API (Application Programming Interface): an interface that allows applications to talk to one another and can be used to facilitate accessing and downloading large amounts of data from a website.
API Wrapper: an API wrapper can facilitate interacting with APIs by providing a way to access an API through a particular programming language or interface to streamline the process of making API calls.
Association: associations measure how often a word co-occurs with other words. The more often words occur close to each other when compared to their general frequency, the higher their association will be (see Collocation).
Automatic term recognition: a term used in natural language processing to describe automatic detection of phrasal terms.
Bag of words (BoW): in the context of text data mining, words are treated as single tokens. BoW is a particular representation model used to simplify the contents of a selection of text. The bag of words model omits grammar and word order but is interested in the number of occurrences of words within the text.
Deep learning: a subcategory of machine learning in which most models are based on artificial neural networks.
Categorization: text categorization is the assignment of labels, typically from a pre-defined set, to a text document. This assignment can be done based on hand-coding or machine learning.
Classification: objects are assigned to pre-defined classes based on similarity. Similar objects are assigned to the same class. The function defining similarity is given by examples for the assignment. These are objects which have been assigned to a class before. The algorithm needs to learn a function that reflects the class definition as determined by the learning examples.
Clustering: a process that groups objects based on similarities. Each cluster contains objects which are more similar to each other than objects in other clusters.
Collocation: a series of words or terms that co-occur more often than would be expected by chance.
Concepts: Meaning is defined beyond a word. A concept is a semantic entity that can be expressed by several words or a group of words.
Concordance: In text mining, concordance tools are used to view words or phrases in context
Corpus: refers to a collection of written texts, particularly the entire body of work on a subject or by a specific creator; a collection of written or spoken material in machine-readable form, assembled to study linguistic structures, frequencies, etc. Such collections may be formed of a single language of texts or can span multiple languages. There are numerous reasons for which multilingual corpora (the plural of corpus) may be helpful. Corpora may also consist of themed texts (historical, Biblical, etc.). Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.
Information Retrieval (IR): the process of accessing and retrieving the most appropriate information from text based on a particular query, using context-based indexing or metadata. IR is concerned with the representation and knowledge and subsequent search for relevant information within these knowledge sources. It provides the technology behind search engines.
Lemmatization: related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma. In other words, is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form.
Lexical analysis: is the very first phase in compiler designing which converts a sequence of characters into a sequence of tokens. The lexical analyzer breaks this syntax into a series of tokens. It removes any extra space or comment written in the source code.
Machine learning: a subfield of artificial intelligence that uses statistical techniques to give computers the ability to learn with data without being explicitly programmed to do so; it can be supervised or unsupervised.
Metadata: Data describing other data. Metadata provides information about one or more aspects of data, such as type, date, creator, location, and so on. Most often encountered in library and archival contexts, metadata facilitates the organization, discovery, and use of a wide range of resources.
N-gram: In linguistics, represents a sequence of n items from a given sequence of text or speech. N-grams can be any combination of letters, phonemes, syllables, words, or letters. in text mining, sequences of words are generated based on how many words, or n, are specified by the user. As opposed to the orderless representation of a bag of words, n-grams modeling is interested in preserving contiguous sequences of N items from the text selection.
Named-entity recognition (NER) (entity extraction): seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Natural Language Processing (NLP): concerns the interaction between natural human languages and computing devices. NLP is a major aspect of computational linguistics, and also falls within the realms of computer science and artificial intelligence.
Normalization: before further processing, the text needs to be normalized. Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing and allows processing to proceed uniformly.
Optical Character Recognition (OCR): Use of computer technologies to convert scanned images of typewritten, printed, or handwritten text into machine-readable text. This conversion allows for the computerization of material texts into formats for digital storage, search, and display. Adobe Acrobat Professional supports OCR processes, as does Microsoft Office for Windows. OCR accuracy depends on the font and style of the original document.
Part-of-speech (POS) tagging: the process of marking up a text-based on the relationship and related words in a phrase/sentence/paragraph. Consists of assigning a category tag to the tokenized parts of a sentence. The most popular POS tagging would be identifying words as nouns, verbs, adjectives, etc.
Precision: in search strategy development, refers to the number of included studies retrieved (true positives) divided by the total number of studies retrieved (sum of true positives and false positives); also known as the positive predictive value in diagnostic testing.
Qualitative data analysis software: allows you to manually code and annotate documents (e.g., QDA Miner, NVivo, Atlas.ti)
Regular Expressions: often abbreviated regexp or regexp, refer to the tried and true method of concisely describing patterns of text. A regular expression is represented as a special text string itself and is meant for developing search patterns on selections of text. Regular expressions can be thought of as an expanded set of rules beyond the wildcard characters of “?” and “*” and constitute powerful text searching tools.
Representational State Transfer (REST): REST is a software architectural style that defines a set of constraints to be used for creating Web services. Web services that conform to the REST architectural style, termed RESTful Web services (RWS), provide interoperability between computer systems on the internet.
Semantic Analysis: also known as a meaning generation, semantic analysis is interested in determining the meaning of text selections (either character or word sequences). After an input selection of text is read and parsed (analyzed syntactically), the text selection can then be interpreted for meaning. Simply put, syntactic analysis is concerned with what words a text selection was made up of, while semantic analysis wants to know what the collection of words actually means. The topic of semantic analysis is both broad and deep, with a wide variety of tools and techniques at the researcher's disposal.
Sensitivity: called recall in computer science and information studies but sensitivity is the preferred term in medical librarianship given its use in diagnostics in medicine; refers to the number of cases/records retrieved, i.e., true positives (by a search engine or diagnostic tool) divided by the total number of relevant cases/records, i.e., the sum of true positives and false negatives; requires a denominator which in search filter development is often developed by hands-searching a set of results from a pre-selected list of journals and years and then classifying them manually as relevant or non-relevant. Denominators may also be referred to as the quasi-gold standard or gold standard.
Sentiment Analysis: the process of evaluating and determining the sentiment captured in a selection of text, with sentiment defined as feeling or emotion. This sentiment can be simply positive (happy), negative (sad or angry), or neutral, or can be some more precise measurement along a scale, with neutral in the middle, and positive and negative increasing in either direction.These subjective parts are identified by text mining methods and separated from objective text parts.
Specificity: in search filter development and diagnostics, refers to the percentage of true negatives (true negatives divided by the sum of true negatives and false positives); the more false positives, the worse the specificity and precision are, but these two measures are calculated differently
Statistical Language Modeling: the process of building a statistical language model which is meant to provide an estimate of a natural language. For a sequence of input words, the model would assign a probability to the entire sequence, which contributes to the estimated likelihood of various possible sequences. This can be especially useful for NLP applications that generate text.
Stemming: refers to the mapping of word forms to stems or basic word forms. This technique is used to reduce words to their root form by removing their endings (e.g., searching for hospital* to retrieve records containing the words hospital, hospitalized, hospitalised, hospitals, etc.). It eliminates affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.
Stop Words: stop words are those words that are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content.
Support vector machine (SVM): Machine learning algorithm used for classification tasks.
Syntactic Analysis: Aka parsing, refers to the task of analyzing strings as symbols and ensuring their conformance to an established set of grammatical rules. This step must, out of necessity, come before any further analysis which attempts to extract insight from the text -- semantic, sentiment, etc. -- treating it as something beyond symbols.
Text mining: Sometimes used interchangeably with text analytics; involves the process of analyzing unstructured text data to identify actionable insights. It often involves data preparation/preprocessing/cleaning to differing degrees depending on the tool being used.
Textual mining software: Supports quantitative analysis of unstructured text; generally supports preprocessing and data cleaning (e.g., quanteda, tm package in R, WordStat)
Term Frequency- Inverse Document Frequency (TF-IDF): Numerical statistic that is intended to reflect how important a word is to a document corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.
Tokenization: generally, an early step in the NLP process, a step that splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized.
This glossary provides a fairly thorough list of terms associated with text data mining (TDM). Click on the corresponding link to navigate terms in alphabetical order. If you identify a term that is not on the list and would like it to be added, please contact us to make a suggestion.
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z