Skip to Main Content

Text Data Mining: Linguistic Corpora

Linguistic Corpora Sources

Corpus Resource Database (CoRD)

CoRD provides links to and descriptions of a large number of corpora, subcorpora, and databases. (University of Helsinki)

English-Corpora.org

English-Corpora.org is the most widely used collection of corpora (highly searchable collections of texts) anywhere in the world. The corpora have been used as the basis of thousands of academic articles, theses, and dissertations, and they form the backbone of courses on language and linguistics throughout the world, at all levels of instruction. Virtually every book on “teaching English with corpora” in the last 5-10 years has focused primarily on these corpora (which are also sometimes called the “BYU Corpora”, for the university where they were created). Since the first corpora were released in 2005, a total of seventeen corpora have been created.

Kaggle NLP Datasets

Data Science code competition and learning platform has many user-contributed datasets open for reanalysis.

Open American National Corpus

15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.

Oxford Text Archive

A digital text repository for literary and linguistic data.

Scottish Corpus of Text & Speech (1945-present)

The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Helsinki Corpus of Older Scots (1450-1700) and the Corpus of Modern Scottish Writing (1700-1945).