Skip to Main Content

Text Data Mining: Social Media & Internet Sources

Social Media & Internet Sources

Blogger Corpus (2004)

The collected posts of 19,320 bloggers were gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts.

Internet Archive

How to download files from archive.org in an automated way using widgets.

Reddit APIs

Access data from posts, threads, comments, users, and more from Reddit and subreddits.
Historical Reddit data has been collected as monthly CSV downloads.

Stanford Large Network Dataset Collection

The SNAP library collects data on large social and information networks since 2004.

Twitter Data

The DREAM Lab supports research using social media data sources, with a focus on access and use of Twitter data. They provide consultation and instruction on a variety of tools and techniques, including Brandwatch (formerly Crimson Hexagon), and NCapture.

Wikipedia Data Dumps

Monthly database backups of all Wikimedia wikis in various formats.

Yelp API

Access to business data, including location, photos, Yelp rating, price levels, hours of operation, and types of transactions. Also includes a Review API, which returns up to 3 review excerpts for a business.