Adam Matthew Collection: Empire Online
The Adam Matthew API can be used to return metadata and full-text from documents, images and sections for Adam Matthew Collections under a current UCSB Library license. Currently, API access is available for Empire Online. If you require offline data access, please send an email describing your research project, and data needs to tdm@library.ucsb.edu.
Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering, and Systems Science, and Economics.
Check out the Natural Language category for a list of text corpora and ngrams for text analysis.
The Caselaw Access Project (“CAP”) expands public access to U.S. law and contains over 360 years (going back to 1658) of published U.S. court decisions, digitized from the collection of the Harvard Law Library.
Chronicling America (Library of Congress)
The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages, and their OCR data is available for bulk download. See the complete list of digitized newspaper titles (1836-1922) for more information.
Article full text, metadata, and citations may be crawled for the purpose of an electronic analysis without special permission or registration, on the condition that it is non-commercial. Any reuse of content for this purpose shall be subject to the restrictions set out in this table.
CORE: Open Access Research Papers
CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.
Corpus Resource Database (CoRD)
CoRD provides links to and descriptions of a large number of corpora, subcorpora, and databases.
CourtListener APIs and Bulk Legal Data
Opinions, docket files, and more from 420 courts.
COVID-19 Open Research Dataset (CoRD-19)
A free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses.
Catalog of hundreds of thousands of public data sets created at the city, state, and federal levels.
Delpher (Dutch language resources)
Dutch newspapers, books, journals, and radio bulletins are available in full-text, along with rich datasets, APIs, and other digital humanities tools for interaction.
Digital Public Library of America (DPLA)
DPLA’s API provides programmatic search and access to every item in the DPLA catalog. Use the API to power an app, to wire DPLA into your portal, or retrieve data.
Documenting the American South Digital Collections
Multiple collections of digitized primary sources related to southern history, literature, and culture. Some collections offer plain-text downloads in their entirety: The Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, North American Slave Narratives.
Researchers can text mine UCSB-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a developer account to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from UCSB IPs to ensure full access
English-Corpora.org is the most widely used collection of corpora (highly searchable collections of texts) anywhere in the world. The corpora have been used as the basis of thousands of academic articles, theses, and dissertations, and they form the backbone of courses on language and linguistics throughout the world, at all levels of instruction. Virtually every book on “teaching English with corpora” in the last 5-10 years has focused primarily on these corpora (which are also sometimes called the “BYU Corpora”, for the university where they were created). Since the first corpora were released in 2005, a total of seventeen corpora have been created.
Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE), and more.
This site offers free downloadable files of the Folger Shakespeare texts in six digital formats: PDF, DOC (for Microsoft Word, Apple Pages, Apache Open Office, etc.), HTML, TXT (i.e., plain text), XML, and TEI Simple. These files are free to use for all non-commercial purposes.
Google nGram viewer allows you to graph word frequency across the corpus of Google Books.
HathiTrust Research Center (HTRC)
Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access.
Note: Berkeley has a guide on the HathiTrust Research Center computational research.
The IEEE Xplore Metadata API provides access to metadata for millions of documents available in the IEEE Xplore Digital Library including IEEE journals, conferences, books/ebooks, courses, and standards; accessible through the IEEE Xplore API Portal.
Inter-University Consortium for Political and Social Research (ICPSR)
ICPSR receives, processes, and distributes data on social phenomena in countries across the world. ICPSR maintains a data archive of on topics in the social and behavioral sciences, including specialized collections in education, aging, criminal justice, substance abuse, terrorism, and other fields. Includes survey data, census records, election returns, economic data, and legislative records.
The Internet Archive Scholar includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth-century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web. Search on its API.
Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:
Library of Congress: 25 million bibliographic metadata records
The LOC release of 25 million MARC records for free bulk download. MARC (Machine Readable Cataloging Records) is an international metadata standard for the representation and communication of bibliographic and related information.
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present. A New York Times online subscription is available to all UCSB students, faculty, and staff see this guide for access information and details.
The Proceedings of the Old Bailey (1674-1913) and the Ordinary of Newgate's Accounts (1676-1772) contain records from 197,745 criminal trials held at London's central criminal court. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn. Use the site API or download XML files.
Downloadable datasets for citations drawn from two large academic graphs: Microsoft Academic Graph (MAG) and Microsoft Academic Graph (MAG) and AMiner.
15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.
A digital text repository for literary and linguistic data.
PLOS (Public Library of Science)
Use Python for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping. Query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the 23 terms in the PLOS Search.
Programmable Web API Directory
Search over 15,000 APIs, or browse by categories.
Project Gutenberg (Mirror sites)
Project Gutenberg hosts over 50k ebooks, most of which are older books in the public domain. If you want to download more than 100 books/day, use one of the mirror sites listed from the link above.
A guide to using this API, called E-Utilities, to access citation data for medical journal literature in PubMed and other NCBI databases, including the National Library of Medicine Catalog, MeSH, Gene, and PMC (PubMed Central).
Browse and search thousands of disciplinary, institutional, and generalist data repositories that include textual data.
A subreddit for sharing and discussing datasets.
Members of subscribing institutions have permission to mine journal content. Before carrying out data or text mining, follow the instructions on this page to contact the Royal Society. In addition, it is a requirement of Royal Society journals that authors deposit their data codes and research materials in public repositories.
Scottish Corpus of Text & Speech (1945-present)
The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Helsinki Corpus of Older Scots (1450-1700) and the Corpus of Modern Scottish Writing (1700-1945).
Springer Digital Content (SpringerLink)
Individual researchers can download subscription (and open access) journal articles and books for TDM purposes directly from Springer Nature’s content platforms. They are requested to limit this to 1 request per second. The selection of desired articles can be conducted by using existing search methods and tools, such as PubMed, Web of Science, or Springer Nature’s Metadata API, among others. An API key can be requested for researchers who want to use Springer Nature’s TDM APIs. The use of the API provides additional querying parameters and a higher bandwidth for content requests (150 requests per minute).
Write queries that compute the amount of time people appear and the number of times words are heard in cable TV news. Data was compiled from the Internet Archive's collection of 24-7 recordings of CNN, Fox News, and MSNBC between January 1, 2010, to present, and updates daily (with a 24-36 hour lag of original air date).
The Text Creation Partnership has produced thousands of accurate, searchable, full-text transcriptions of early print books. It provides the full text of Early English Books Online, Eighteenth Century Collections Online, and Evans Early American Imprints.
The General Index consists of 3 tables derived from 107,233,728 journal articles. A table of n-grams, ranging from unigrams to 5-grams, is extracted using SpaCy. Each of the 355,279,820,087 rows of the n-gram table consists of an n-gram coupled with a journal article id. A second table is constructed using Yake and consists of 19,740,906,314 rows, each with a keyword and an article id. A third table associates an article id with metadata.
Wiley supports TDM on Wiley content. Access to content for TDM purposes takes place through the Crossref Text API. Visit Text and Data Mining for Researchers for details.
We have listed a number of sources from which you may consider extracting/obtaining text corpora. Please be advised that this is not a comprehensive list of all existing sources and permissions might change rapidly and be highly context-dependable. We encourage you to check the terms and agreements expressed on the websites and reach out to us if you have questions.
Please navigate the menu on the left-hand side for the main categories based on data types, disciplinary topics, and provenance, or the A-Z list below:
If you have any questions about access and permissions to the listed sources or those note listed, please contact us: tdm@library.ucsb.edu