Skip to Main Content

Text Data Mining: Methods

Methods

TDM emerges out of a group of distinct yet related disciplines, including artificial intelligence and machine learning, data mining, statistics, library and information sciences, computational linguistics, and databases. According to Miner et al. (2012), the intersection of these fields produces seven practices to TDM:

  1. Search and information retrieval (IR): Storage and retrieval of text documents, including search engines and keyword search.
  2. Document clustering: Grouping and categorizing terms, snippets, paragraphs, or documents, using data mining clustering methods.
  3. Document classification: Grouping and categorizing snippets, paragraphs, or documents, using data mining classification methods, based on models trained on labeled examples.
  4. Web mining: Data and text mining on the Internet, with a specific focus on the scale and interconnectedness of the web.
  5. Information extraction (IE): Identification and extraction of relevant facts and relationships from unstructured text; the process of making structured data from unstructured and semi-structured text.
  6. Natural language processing (NLP): Low-level language processing and understanding tasks (e.g., tagging part of speech); often used synonymously with computational linguistics.
  7. Concept extraction: Grouping of words and phrases into semantically similar groups.

Each of these practices can be broken down into different techniques, including, but not limited to:

  • Word Frequencies: Computing word frequencies is a basic building block of higher-level textual analysis algorithms. This method can include raw word counts, or calculating the percentage of words in a text or set of texts and comparing that across texts or time. Frequencies can also be counted for "n-grams," or phrases with a certain number (n) of words.
  • Topic modeling: A form of machine learning, topic modeling is a way of identifying patterns and themes in a body of text. It is done by statistical algorithms, such as Latent Dirichlet Allocation, which groups words into "topics" based on which words frequently co-occur in a text.
  • Network analysis: A method for finding connections between nodes representing people, concepts, sources, and more. These networks are usually visualized into graphs that show the interconnectedness of the nodes.
  • Citation analysis: An approach to discovering connections and relationships between various citations of documents and then visualized.
  • Visualization: Text mining visualization can help researchers see relationships between certain concepts.  Examples are word clouds, graphs, maps, and other graphics that produce a visual depiction of the data.

The method you choose will depend on your research question(s). When choosing a method to use, first consider what you expect to learn from your research and what form you would like your results to take. You can combine various methods in different ways during the course of your research project. For example, natural language processing algorithms might reveal the names of people in your text, to which you could apply network analysis to study how the actors are connected.