Skip to Main Content

Text Data Mining: Exploring Data Sources

A

Adam Matthew Collection: Empire Online

The Adam Matthew API can be used to return metadata and full-text from documents, images and sections for Adam Matthew Collections under a current UCSB Library license. Currently, API access is available for Empire Online. If you require offline data access, please send an email describing your research project, and data needs to tdm@library.ucsb.edu.

arXiv Bulk Data

Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering, and Systems Science, and Economics.

Awesome Public Datasets

Check out the Natural Language category for a list of text corpora and ngrams for text analysis.

Go back

B

BioMed Central

Peer-reviewed Biomed Central articles are available for text and data mining through the BMC API.

Go back

C

Caselaw Access Project

The Caselaw Access Project (“CAP”) expands public access to U.S. law and contains over 360 years (going back to 1658) of published U.S. court decisions, digitized from the collection of the Harvard Law Library.

Chronicling America (Library of Congress)

The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages, and their OCR data is available for bulk download. See the complete list of digitized newspaper titles (1836-1922) for more information.

Company of Biologists

Article full text, metadata, and citations may be crawled for the purpose of an electronic analysis without special permission or registration, on the condition that it is non-commercial. Any reuse of content for this purpose shall be subject to the restrictions set out in this table.

CORE: Open Access Research Papers

CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.

Corpus Resource Database (CoRD)

CoRD provides links to and descriptions of a large number of corpora, subcorpora, and databases. 

CourtListener APIs and Bulk Legal Data

Opinions, docket files, and more from 420 courts.

COVID-19 Open Research Dataset (CoRD-19)

A free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses.

Go back

D

Data.gov

Catalog of hundreds of thousands of public data sets created at the city, state, and federal levels.

Delpher (Dutch language resources)

Dutch newspapers, books, journals, and radio bulletins are available in full-text, along with rich datasets, APIs, and other digital humanities tools for interaction.

Digital Public Library of America (DPLA)

DPLA’s API provides programmatic search and access to every item in the DPLA catalog. Use the API to power an app, to wire DPLA into your portal, or retrieve data.

Documenting the American South Digital Collections

Multiple collections of digitized primary sources related to southern history, literature, and culture. Some collections offer plain-text downloads in their entirety: The Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, North American Slave Narratives.

Go back

E

Elsevier (Science Direct)

Researchers can text mine UCSB-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a  developer account  to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from UCSB IPs to ensure full access

English-Corpora.org

English-Corpora.org is the most widely used collection of corpora (highly searchable collections of texts) anywhere in the world. The corpora have been used as the basis of thousands of academic articles, theses, and dissertations, and they form the backbone of courses on language and linguistics throughout the world, at all levels of instruction. Virtually every book on “teaching English with corpora” in the last 5-10 years has focused primarily on these corpora (which are also sometimes called the “BYU Corpora”, for the university where they were created). Since the first corpora were released in 2005, a total of seventeen corpora have been created.

Go back

F

FDSys: Bulk Data

Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE), and more.

Folger Shakespeare Library

This site offers free downloadable files of the Folger Shakespeare texts in six digital formats: PDF, DOC (for Microsoft Word, Apple Pages, Apache Open Office, etc.), HTML, TXT (i.e., plain text), XML, and TEI Simple. These files are free to use for all non-commercial purposes.

Go back

G

Google Books

Google nGram viewer allows you to graph word frequency across the corpus of Google Books.

Go back

H

HathiTrust Research Center (HTRC)

Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. 

Note: Berkeley has a guide on the HathiTrust Research Center computational research.

Go back

I

IEEE Xplore Digital Library

The IEEE Xplore Metadata API provides access to metadata for millions of documents available in the IEEE Xplore Digital Library including IEEE journals, conferences, books/ebooks, courses, and standards; accessible through the IEEE Xplore API Portal.

Inter-University Consortium for Political and Social Research (ICPSR)

ICPSR receives, processes, and distributes data on social phenomena in countries across the world. ICPSR maintains a data archive of on topics in the social and behavioral sciences, including specialized collections in education, aging, criminal justice, substance abuse, terrorism, and other fields. Includes survey data, census records, election returns, economic data, and legislative records.

Internet Archive Scholar

The Internet Archive Scholar includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth-century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web. Search on its API.

Go back

J

JSTOR Data for Research

Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:

  • JSTOR DfR in GitHub  - A number of Python and R packages to work with JSTOR DfR data.
  • JSTOR's Text Analyzer  - A reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials.

Go back

L

Library of Congress: 25 million bibliographic metadata records

The LOC release of 25 million MARC records for free bulk download. MARC (Machine Readable Cataloging Records) is an international metadata standard for the representation and communication of bibliographic and related information.

Go back

N

New York Times

The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.  A New York Times online subscription is available to all UCSB students, faculty, and staff see this guide for access information and details.

Go back

O

Old Bailey Online

The Proceedings of the Old Bailey (1674-1913) and the Ordinary of Newgate's Accounts (1676-1772) contain records from 197,745 criminal trials held at London's central criminal court. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn. Use the site API or download XML files.

Open Academic Graph

Downloadable datasets for citations drawn from two large academic graphs: Microsoft Academic Graph (MAG) and  Microsoft Academic Graph (MAG) and AMiner.

Open American National Corpus

15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.

Oxford Text Archive

A digital text repository for literary and linguistic data.

Go back

P

PLOS (Public Library of Science)

Use Python for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping. Query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the 23 terms in the PLOS Search.

Programmable Web API Directory

Search over 15,000 APIs, or browse by categories.

Project Gutenberg (Mirror sites)

Project Gutenberg hosts over 50k ebooks, most of which are older books in the public domain. If you want to download more than 100 books/day, use one of the mirror sites listed from the link above.

PubMed and NLM: Data Guide

A guide to using this API, called E-Utilities, to access citation data for medical journal literature in PubMed and other NCBI databases, including the National Library of Medicine Catalog, MeSH, Gene, and PMC (PubMed Central).

Go back

R

Re3data.org

Browse and search thousands of disciplinary, institutional, and generalist data repositories that include textual data.

Reddit Datasets

A subreddit for sharing and discussing datasets.

Royal Society

Members of subscribing institutions have permission to mine journal content. Before carrying out data or text mining, follow the instructions on this page to contact the Royal Society. In addition, it is a requirement of Royal Society journals that authors deposit their data codes and research materials in public repositories.

Go back

S

Scottish Corpus of Text & Speech (1945-present)

The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Helsinki Corpus of Older Scots (1450-1700) and the Corpus of Modern Scottish Writing (1700-1945).

Springer Digital Content (SpringerLink)

Individual researchers can download subscription (and open access) journal articles and books for TDM purposes directly from Springer Nature’s content platforms. They are requested to limit this to 1 request per second. The selection of desired articles can be conducted by using existing search methods and tools, such as PubMed, Web of Science, or Springer Nature’s Metadata API, among others. An API key can be requested for researchers who want to use Springer Nature’s TDM APIs. The use of the API provides additional querying parameters and a higher bandwidth for content requests (150 requests per minute).

Stanford Cable TV Analyzer

Write queries that compute the amount of time people appear and the number of times words are heard in cable TV news. Data was compiled from the Internet Archive's collection of 24-7 recordings of CNN, Fox News, and MSNBC between January 1, 2010, to present, and updates daily (with a 24-36 hour lag of original air date).

Go back

T

Text Creation Partnership

The Text Creation Partnership has produced thousands of accurate, searchable, full-text transcriptions of early print books. It provides the full text of Early English Books Online, Eighteenth Century Collections Online, and Evans Early American Imprints.

The General Index

The General Index consists of 3 tables derived from 107,233,728 journal articles. A table of n-grams, ranging from unigrams to 5-grams, is extracted using SpaCy. Each of the 355,279,820,087 rows of the n-gram table consists of an n-gram coupled with a journal article id. A second table is constructed using Yake and consists of 19,740,906,314 rows, each with a keyword and an article id. A third table associates an article id with metadata.

Go back

W

Wiley

Wiley supports TDM on Wiley content. Access to content for TDM purposes takes place through the Crossref Text API. Visit Text and Data Mining for Researchers for details.

Go back

Available Sources for Text Data Mining

We have listed a number of sources from which you may consider extracting/obtaining text corpora. Please be advised that this is not a comprehensive list of all existing sources and permissions might change rapidly and be highly context-dependable. We encourage you to check the terms and agreements expressed on the websites and reach out to us if you have questions.

Please navigate the menu on the left-hand side for the main categories based on data types, disciplinary topics, and provenance, or the A-Z list below:

A | B | C | D | E | F | G | HI | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

If you have any questions about access and permissions to the listed sources or those note listed, please contact us: tdm@library.ucsb.edu