LibGuides: Text Data Mining: Obtaining Data

Methods for Obtaining Data

You may run into obstacles around copyright, technical knowledge, or licensing agreements when you attempt to access specific text corpora, so we highly encourage you to contact us at tdm@library.ucsb.edu before you begin your project. We also suggest that you review the "Getting Started" tab for key questions about data acquisition and gathering, and to consult the “Exploring Data Sources” to find a list of websites that allow access to their materials.

You may obtain your text corpora by following one of these methods: 1) Manual download of materials, 2) web scraping, 3) API calls and, 4) Character Recognition. The first three methods refer to born-digital materials (i.e. those that were created in digital format, not later converted to digital) and are more often than not hosted on the web. While not impossible, text mining non-digital materials is a more complex task that requires an additional step to produce the corpora through OCR or ICR. You may use the Library's free scanners which have in-built OCR programs that convert text into digital form. Please refer to the boxes below for more information on each of these methods.

Please be advised that most vendor licenses do not permit the massive downloading of data from the UCSB Library's subscription content. Unauthorized data scraping violates many of the UCSB Library's licenses and will result in the vendor/s shutting down access of content to the particular IP address where the downloading is being done. If this happens, the entire UCSB community will be denied access to the specific databases, where massive downloads occurred. Please read the bottom of the page for more information on copyright issues and text data mining.

To learn more about these different approaches to text data gathering, please consult with us: tdm@library.ucsb.edu

Application Programming Interface (APIs)

There are different types of APIs. If a website provides access to an API you can either:

Use the web browser to access
Use Python or another language to access
Use a third-party app with an interface to access

One important note for accessing APIs is that most will require a “key” to access. You can receive a key by filling out an application or some kind of identity verification. Not all sites provide this kind of access and some are completely open, but most sites will require the use of a key through some kind of identification or application process. Make sure to incorporate this into your research timeline.

Manual Download of Materials

While this is often the most time-consuming method it requires the least amount of technical knowledge and carries the fewest legal concerns. There are three common ways you may be able to manually download your research materials.

Bulk download: The vendor, website host, or owner of the data provides it to you. This could be because it’s available for free as a dataset, or you ask and they provide it to you. You may also be asked to pay for access.
Batch download: Using a search interface such as EBSCO to find a set of papers you wish to use for your research. Some databases allow you to do bulk downloads such as 100 papers at once or 500 papers at once, others you may have to download one at a time
Download one-by-one: In the case that the vendor or owner of the data will not or cannot provide bulk download or when there isn’t a search interface that allows batch download, you may need to resort to downloading the documents you're interested in individually.

Web Scraping

There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process. Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data. There are three ways to obtain data from web resources through web-scraping:

Browser extensions (e.g. Scraper)
Scraping programmatically (e.g., Python, R)
Third-party services

Please refer to the Web Scraping Workshop for more information on the two first approaches to web scraping.

OCR/ICR (Non-Digital Materials)

You may have to work with text that is available in print only. To produce the text corpora that you will analyze, you must first convert the analogical materials to digital form. Optical Character Recognition (OCR) is the electronic conversion of images of typed, handwritten, or printed text into machine-encoded text. OCR translates a document into an editable format, and some database programs may accept input directly from the OCR reader. A more robust version of the OCR technology is Intelligent Character Recognition (ICR), which allows fonts and different handwriting styles to be learned by a computer during processing, thus improving accuracy and pattern recognition levels. Most scanners come with OCR software, and there are many free options in the market. Regardless of the robustness of the character recognition software you may use, it would be best if you plan to check and verify the text before performing any analyses.

Library Support & TDM Restrictions with Licensed Electronic Resources

TDM is frequently classified as fair use under US copyright law. However, Library resources’ Terms of Use are governed by license agreements with publishers and other entities. It may be possible to perform limited scraping or crawling on databases without infringing the licensing terms; proceeding without understanding and following a vendor's unique protocol can jeopardize your research as well as access to the resource for the entire UCSB campus (or the entire UC-system). Unauthorized activity, such as scraping, bulk downloading, crawling, bots, etc., will cause vendors to revoke access for the entire campus when detected.

The Library is here to help facilitate your research. For all TDM projects using content from a library licensed resource or other content sources where you are uncertain of rights or permissions, please contact the library at tdm@library.ucsb.edu.

The UCSB Library and the California Digital Library (CDL) continue to mediate or attempt to negotiate TDM rights with vendors or third-party aggregators on a case-by-case basis. The UCSB Library or CDL Licensing Team can:

Provide help and support to understand the existing licensing terms for the content already licensed by UCSB or CDL.
Explore negotiating the TDM rights for the content that is already licensed by UCSB or CDL.
Purchase or license the content for TDM activities not yet owned or licensed by UCSB or CDL.

The UCSB library will consider purchasing or licensing data and datasets that meet the following criteria:

Prioritizing utility and usability: the library will prioritize purchases supporting the current research, teaching, and learning needs or with broad appeal that can be applied in instructional and research settings at UCSB. In addition, the library will only purchase data that is accessible to all UCSB users: students, staff, and faculty.
Scope: Preference is given to datasets that are one-time purchases and do not require regular updates. Any time period, geographic area, or language will be considered. Numeric, quantitative, geospatial, and textual data will be considered. Data that contains confidential or personally identifiable information (PII) will not be considered.
Quality: Data must be from a credible source that has the right to sell the information. Preference will be given to data and datasets with robust documentation and metadata.
Format and access: Preference will be given to data in non-proprietary file formats, hosted online, and accessed remotely. Preference will be given to vendors who comply with the Americans with Disabilities Act (ADA) by providing data compliant with the latest version of Web Content Accessibility Guidelines (WCAG).
Licensing: The Library will work to license datasets that allow text and data mining (TDM) and enable users to share the results of TDM in their scholarly work. The Library will attempt to have additional terms included in the license, such as scholarly sharing, creation of derivative works, and the ability of the Library to maintain a backup copy of the data. Licenses for datasets that require patrons to sign individual use agreements are strongly discouraged.

Regarding licensed content from third-party vendors that don't meet the above criteria, the faculty and students can pursue their own license or rights to use the data with the assistance of the Technology & Industry Alliances Office.