Skip to Main Content

Text Data Mining: Scholarly Journals

Scholarly Journals Sources

arXiv Bulk Data

Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering, and Systems Science, and Economics.

BioMed Central

Peer-reviewed Biomed Central articles are available for text and data mining through the BMC API.

Company of Biologists

Article full text, metadata, and citations may be crawled for the purpose of an electronic analysis without special permission or registration, on the condition that it is non-commercial. Any reuse of content for this purpose shall be subject to the restrictions set out in this table.

CORE: Open Access Research Papers

CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.

Elsevier (ScienceDirect)

Researchers can text mine UCSB-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a  developer account  to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from UCSB IPs to ensure full access

The General Index

The General Index consists of 3 tables derived from 107,233,728 journal articles. A table of n-grams, ranging from unigrams to 5-grams, is extracted using SpaCy. Each of the 355,279,820,087 rows of the n-gram table consists of an n-gram coupled with a journal article id. A second table is constructed using Yake and consists of 19,740,906,314 rows, each with a keyword and an article id. A third table associates an article id with metadata. 

IEEE Xplore Digital Library

The IEEE Xplore Metadata API provides access to metadata for millions of documents available in the IEEE Xplore Digital Library including IEEE journals, conferences, books/ebooks, courses, and standards; accessible through the IEEE Xplore API Portal.

Internet Archive Scholar

The Internet Archive Scholar includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth-century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web. Search on its API.

JSTOR Data for Research

Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:

JSTOR DfR in GitHub  - A number of Python and R packages to work with JSTOR DfR data.

JSTOR's Text Analyzer  - A reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials.

PLOS (Public Library of Science)

Python tool for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping. See also:  (http://api.plos.org/solr/faq/) to query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the 23 terms in the PLOS Search.

Royal Society

Members of subscribing institutions have permission to mine journal content. Before carrying out data or text mining, follow instructions on this page to contact the Royal Society. In addition, it is a requirement of Royal Society journals that authors deposit their data codes and research materials in public repositories.

Springer Digital Content (SpringerLink)

Individual researchers can download subscription (and open access) journal articles and books for TDM purposes directly from Springer Nature’s content platforms. They are requested to limit this to 1 request per second. The selection of desired articles can be conducted by using existing search methods and tools, such as PubMed, Web of Science, or Springer Nature’s Metadata API, among others. An API key can be requested for researchers who want to use Springer Nature’s TDM APIs. The use of the API provides additional querying parameters and a higher bandwidth for content requests (150 requests per minute).

Wiley

Wiley supports TDM on Wiley content. Access to content for TDM purposes takes place through the Crossref Text API. Visit Text and Data Mining for Researchers for details.