Text mining framework and methods

There are many different methods to use in text mining. The goal here is to provide a basic framework to apply to such an endeavor. This framework is not inclusive of all the possible methods, but will cover those that are probably the most important for the vast majority of projects that you will work on. Additionally, I will discuss the modeling methods in as succinct and clear a manner as possible, because they can get quite complicated. Gathering and compiling text data is a topic that could take up several chapters. One of the things I prefer and will put forward here is the use of the tidy framework. It will allow us to use tibbles and data frames for most of the steps, and the tidytext functions allow an easy transition to other types of text mining structures, such as a corpus.

The first task is to put the text files into a data frame. With that created, the data preparation can begin with the text transformation.

The following list is composed of probably some of the most common and useful transformations for text files:

Change capital letters to lowercase
Remove numbers
Remove punctuation
Remove stop words
Remove excess whitespace characters
Word stemming
Word replacement

With these transformations, you are creating a more compact dataset and simplify the structure in order to facilitate relationships between the words, thereby leading to increased understanding. However, keep in mind that not all of these transformations are necessary all the time and judgment must be applied, or you can iterate to find the transformations that make the most sense.

By changing words to lowercase, you can prevent the improper counting of words. Say that you have a count for hockey three times and Hockey once, where it is the first word in a sentence. R will not give you a count of hockey=4, but hockey=3 and Hockey=1.

Removing punctuation also achieves the same purpose, but in some cases, punctuation is important, especially if you want to tokenize your documents by sentences.

In removing stop words, you are getting rid of the common words that have no value; in fact, they are detrimental to the analysis, as their frequency masks important words. Examples of stop words are and, is, the, not, and to.

Removing whitespace makes data more compact by getting rid of things such as tabs, paragraph breaks, double-spacing, and so on.

The stemming of words can get tricky and might add to your confusion because it deletes word suffixes, creating the base word, or what is known as the radical. I personally am not a big fan of stemming and the analysts I've worked with agree with that sentiment. Recall that R would count this as two separate words. By running a stemming algorithm, the stemmed word for the two instances would become famili. This would prevent the incorrect count, but in some cases it can be odd to interpret and is not very visually appealing in a word cloud for presentation purposes. In some cases, it may make sense to run your analysis with both stemmed and unstemmed words in order to see which one facilitates understanding.

Probably the most optional of the transformations is to replace the words. The goal of replacement is to combine words with a similar meaning, for example, management and leadership. You can also use it in lieu of stemming. I once examined the outcome of stemmed and unstemmed words and concluded that I could achieve a more meaningful result by replacing about a dozen words instead of stemming. It can be important when you have manual data entry and different operators input data differently. For example, tech support person one types in the system turbocharger, while tech support person two types in turbo charger half the time, and turbo-charger the other half. All three versions are the same, so applying a replacement function such as gsub() or grepl() will solve the problem.

With transformations completed, one structure to create for topic modeling or classification is either a document-term matrix (DTM) or term-document matrix (TDM). What either of these matrices does is create a matrix of word counts for each individual document in the matrix. A DTM would have the documents as rows and the words as columns, while in a TDM, the reverse is true. We will be using a DTM for our example.

Table of Contents for Text mining framework and methods

Create new playlist

Sign In

Sign Up

Table of Contents for
Text mining framework and methods