Skip to Main Content

Text Data Mining: Historical & Archival Collections

Historical & Archival Collections Sources

Adam Matthew Collection: Empire Online

The Adam Matthew API can be used to return metadata and full-text from documents, images and sections for Adam Matthew Collections under a current UCSB Library license. Currently, API access is available for Empire Online. If you require offline data access, please send an email describing your research project, and data needs to tdm@library.ucsb.edu.

Chronicling America (Library of Congress)

The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages, and their OCR data (https://chroniclingamerica.loc.gov/ocr/) is available for bulk download. See the complete list of digitized newspaper titles (https://chroniclingamerica.loc.gov/search/titles/#tab=tab_newspapers) (1836-1922) for more information.

Documenting the American South Digital Collections

Multiple collections of digitized primary sources related to southern history, literature, and culture. Some collections offer plain-text downloads in their entirety: The Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, North American Slave Narratives.

HathiTrust Research Center (HTRC)

Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. Sign up for an account at https://analytics.hathitrust.org/signuppage [Note: Berkeley has a guide on HTRC at https://guides.lib.berkeley.edu/c.php?g=491766&p=3381443]

Old Bailey Online

The Proceedings of the Old Bailey (1674-1913) and the Ordinary of Newgate's Accounts (1676-1772) contain records from 197,745 criminal trials held at London's central criminal court. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn. Use the site API or download XML files.

Project Gutenberg (Mirror sites)

Project Gutenberg hosts over 50k ebooks, most of which are older books in the public domain. If you want to download more than 100 books/day, use one of the mirror sites listed from the link above.