By Bella Ratmelia, Librarian, Data Services
For people who are unfamiliar with it, just the mention of “Text and data mining (TDM)” alone can be intimidating. In addition to that, a lot of resources and guides on “what is TDM?” out there can be quite technical in nature which may add to the “intimidating” factor. Therefore, in this article, I will briefly give a novice-friendly overview about TDM, including the common tasks done in academic studies that utilise TDM techniques.
Text and data mining (TDM) or sometimes referred to as “text mining” is a process to find new or valuable insights from unstructured and often large amount of text.
These texts are typically retrieved from electronic sources or documents, either via web scraping or through an API. Getting the raw text is only the first step, they need to be cleaned up (or commonly referred to as pre-processed) before they can be analysed using Natural Language Processing (NLP) techniques.
Pre-processing
Pre-processing is cleaning up the text before it can be used for analysis. Common pre-processing steps include:
- Expand contractions. In this step, words such as “don’t” or “I’m” will be expanded to “do not” and “I am”.
- Tokenization. Tokenization is simply breaking down sentences into individual word or “token”. So, the sentence “good morning” will be broken down into the tokens “good” and “morning”.
- Convert to lowercase. This step ensures all words/tokens are all in lowercase as some analysis tools/models are case-sensitive.
- Removing stop words and punctuation. Stop words are list of words to be removed from the text, which typically consists of common particle such as “an”, “the”, “of”, and so on. There is no universal or standardised list of stop words, and researchers would often customise the stop words used in their project. Punctuation and numeric characters will be removed as well during this step.
- Lemmatising or stemming. Lemmatising is transforming the word into its base form (the “lemma” or dictionary form) by taking into consideration the context of its use in the sentence. For example, “studies” will be lemmatised into “study”.
Stemming refers to simply reducing the word into its root (the “stem” and trimming off the suffixes. So “studies” will be stemmed into “studi” and the “es” suffix is stemmed.
Whether to choose Lemmatising or stemming depends on the objective of your study and other complex factors that will not be covered in this article for brevity.
While there is software that helps do various aspects of text analysis or even NLP without coding such as Voyant, AntConc etc, their capabilities are limited comparing to using coding libraries. Common NLP toolkits such as the Python libraries Natural Language ToolKit (NLTK) or spaCy are equipped with features that will make the entire preprocessing endeavor much easier for the researchers. Once it’s done, the texts are ready to be analysed via NLP techniques.
NLP Techniques
Some of the NLP techniques more commonly used in academic research include (but are not limited to) the following:
- Part-of-speech tagging is a process to tag words with their role/context in a sentence e.g., as a noun, adjective, adverb, punctuation, etc. This is also the process that is used in Named-entity recognition (NER), a process to tag entities such as organisations, countries, person names, cities, etc. that are mentioned in the text.
- Sentiment Analysis is used to identify the sentiment/emotional tone in a given text. The sentiment extracted can then be classified into categories such as positive, neutral, or negative. An example of sentiment analysis can be seen in these studies from InK:
- Text classification is used to organise textual data using a classification model/algorithm into a pre-defined set of categories. Somewhat related is Topic Modelling which is the task of discovering/extracting “topics” from a set of documents. An application of this technique can be seen in this study:
- Keyword extractions is a process to identify the most relevant words or phrases in a text, to quickly get an overview of what the text is about.
Depending on the techniques used, some parts of the preprocessing are not required. For example, researchers who would like to use Part-of-speech tagging should skip the stop words and punctuation removal as they are necessary to give context to a sentence. It is also common to see several techniques used together in a research project.
It is important to note that while the techniques are similar, they often need to be customised to the domain of research. For example, the field of Finance and Accounting has jargon of its own and if you try to apply generic pre-trained sentiment analysis models that are not customised, the results might not be accurate. Take the word “liability”, which in most dictionaries have negative sentiment, but is mostly a neutral term in Finance.
Challenges
Just like any other research efforts, TDM is not without its challenges. A common one is typically encountered during the information retrieval process, in which the data that is needed by the researchers are often hidden behind legal licenses (or expensive fees) that restrict information access for text mining. Fortunately, the new Singapore Copyright law provides an exception for TDM whereby researchers would be allowed to conduct TDM on resources that they have legal access to. This new law has been in effect since November 2021 (Check out SMU’s OLGA’s guidelines here).
Another common one is technical in nature. Not all content providers have API that can be used to retrieve the required information, or the content providers presented the information in a format that’s not TDM-friendly such as PDF files.
When it comes to library resources, in fact, some of our databases do have API that you can use. The requirements and details for each API is compiled in this API Research Guide.
In Summary (TL;DR version)
The activities in a TDM research can be roughly explained in 3 steps: retrieving the text data, cleaning up the text data for analysis (pre-processing), and then the actual data analysis. There are quite a few steps involved in pre-processing, but tools such as NLTK or spaCy should be able to make that easier.
The common challenges are restrictive content licensing terms imposed by content providers or their lack of technical infrastructure (a.k.a API) to support TDM-friendly information retrieval. As for SMU Libraries, the API Research Guide contains information on which library resources have API that you can use.
As always, if you have any comments, questions, or feedback, please do not hesitate to contact me at bellar@smu.edu.sg