Skip to Main Content

Text Data Mining: Literature

Literature Sources

Delpher (Dutch language resources)

Dutch newspapers, books, journals, and radio bulletins are available in full-text, along with rich datasets, APIs, and other digital humanities tools for interaction.

Digital Public Library of America (DPLA)

DPLA’s API provides programmatic search and access to every item in the DPLA catalog. Use the API to power an app, to wire DPLA into your portal, or retrieve data.

Documenting the American South Digital Collections

Multiple collections of digitized primary sources related to southern history, literature, and culture. Some collections offer plain-text downloads in their entirety: The Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, North American Slave Narratives.

Folger Shakespeare Library

This site offers free downloadable files of the Folger Shakespeare texts in six digital formats: PDF, DOC (for Microsoft Word, Apple Pages, Apache Open Office, etc.), HTML, TXT (i.e., plain text), XML, and TEI Simple. These files are free to use for all non-commercial purposes.

Google Books

Google nGram viewer allows you to graph word frequency across the corpus of Google Books.

HathiTrust Research Center (HTRC)

Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. 

Oxford Text Archive

A digital text repository for literary and linguistic data. 

Project Gutenberg (Mirror sites)

Project Gutenberg hosts over 50k ebooks, most of which are older books in the public domain. If you want to download more than about 100 books/day, use one of the mirror sites listed in the link above.

Text Creation Partnership

The Text Creation Partnership has produced thousands of accurate, searchable, full-text transcriptions of early print books. It provides the full text of Early English Books Online, Eighteenth Century Collections Online, and Evans Early American Imprints.