Skip to Main Content

Text Data Mining: Archiving & Sharing the Data

What should you save and preserve?

Not all text corpora can be easily re-obtainable, and the ephemeral nature of the web means that a resource or database can become temporarily unavailable, permanently restricted, or even discontinued. Having a data management plan (DMP) that includes strategies for documenting, storing, and promoting the longevity of the data, is critical for the success of your project. But what should be saved and preserved? Below are some tips for organizing your data while making it more easily shareable and reusable:

 

What? Why? How?
Raw Data Allows for backtracking Keep a copy of the original data as downloaded or scraped from the web resource in the native file extension. 
Processed Data Supports understandability and transparency

Keep a version of the intermediate data that stores all processes (manipulations and transformations) performed with the raw data and a codebook or a data dictionary to address all actions performed (cleaning, sorting, rearranging, etc.). If the original/raw data was in a proprietary format, make sure to create an open equivalent of the file.

Analysis Data Assists repeatability and inspectability Keep an "analysis-ready" copy of your data. This should be the exact same file you would use to re-create visualizations or repeat the analysis if needed. 
Codes/Scripts Supports reusability

Record all codes and scripts used for processing, analyzing the data, and producing visualizations. Make sure that your code/scripts are easy to understand, efficient when run, and well-documented. 

README File Promotes understandability and reusability Create a simple text master file that includes information and metadata about all files in your project directory. It should describe the underlying data of your project and how files are logically associated with one another.

If you have questions about DMP preparation, data documentation, archiving, or preservation contact Research Data Services rds@library.ucsb.edu

Organizing Your Project Folder

We advise you to keep your project folder organized, by scaffolding data, scripts, code, results, and other research-related materials, including the README.txt file.

Documenting Your Code

If your TDM project uses the programming approach to collect the data (e.g., R or Python), we advise you to document your code following the recommendations and rules outlined in this infographic.

Creating the README File

When working on your TDM project, make sure to create a README.txt file that contains information about the provenance of the text corpora you are using, files and their relationships, your methods for obtaining and processing the data, and any licenses, and restrictions governing the data.

Where to Archive and Share the Data?

Most journals and funding agencies require that you share your research data. If you do not have permission to share the raw data, you may share the processed or annotated data that you have created, and attribute the original data source. We advise you to select an open certified digital repository or archive in which to share your data. 

At UCSB we support Dryad, but there are many other options listed at the Registry of Research Data Repositories Re3data.org that you may find suitable to house and preserve your research data and its associated code and documentation. When in doubt, check the recommendations below for "Choosing a Data Repository" or contact us at tdm@library.ucsb.edu.  

Choosing a Data Repository

Check some tips and recommendations for choosing a data repository in which to store and preserve your research data.