Chapter 12. Text Mining

 

"I think it's much more interesting to live not knowing than to have answers which might be wrong."

 
 --Richard Feynman

The world is awash in textual data. If you Google, Bing, or Yahoo how much of the data is unstructured, that is, in a textual format, estimates would range from 80 to 90 percent. The real number doesn't matter. What does matter is that a large proportion of the data is in a text format. The implication is that anyone seeking to find insights in the data must develop the capability to process and analyze text.

When I first started out as a market researcher, I used to manually pore through page after page of moderator-led focus groups and interviews with the hope of capturing some qualitative insight—an Aha! moment if you will—and then haggle with fellow team members over whether they had the same insight or not. Then, you would always have that one individual in a project who would swoop in and listen to two interviews—out of the 30 or 40 on the schedule—and alas, they had their mind made up on what was really happening in the world. Contrast that with the techniques being used now, where an analyst can quickly distil the data into meaningful quantitative results, support the qualitative understanding, and maybe even sway the swooper.

Over the last several years, I've applied the techniques discussed here to mine physician-patient interactions, understand FDA fears on prescription drug advertising, and capture patient concerns in a rare cancer, to name just a few. Using R and the methods in this chapter, you too can extract the powerful information in the textual data.

Text mining framework and methods

There are many different methods to use in text mining. The goal here is to provide a basic framework to apply to such an endeavor. This framework is not all-inclusive of the possible methods but will cover those that are probably the most important for the vast majority of projects that you will work on. Additionally, I will discuss the modeling methods in as succinct and clear a manner as possible because they can get quite a bit complicated. Gathering and compiling the text data is a topic that could take up several chapters. Therefore, let's begin with the assumption that the data is available from Twitter, a customer call center, scraped off the web, or whatever and is contained in some sort of text file or files.

The first task is to put the text files in one structured file referred to as a Corpus. The number of documents could be just one, dozens, hundreds, or even thousands. R can handle a number of raw text files including RSS feeds, pdf files, and MS Word documents. With the corpus created, the data preparation can begin with the text transformation.

The following list comprises of probably some of the most common and useful transformations for text files:

  • Change capital letters to lowercase
  • Remove numbers
  • Remove punctuation
  • Remove stop words
  • Remove excess whitespace
  • Word stemming
  • Word replacement

In transforming the corpus, you are creating not only a more compact dataset, but also simplifying the structure in order to facilitate relationships among the words, thereby leading to an increased understanding. However, keep in mind that not all of these transformations are necessary all the time and judgment must be applied, or you can iterate to find the transformations that make the most sense.

By changing words to the lowercase, you can prevent the improper counting of words. Say that you have a count for hockey three times and Hockey once where it is the first word in a sentence. R will not give you a count of hockey=4, but hockey=3 and Hockey=1.

Removing punctuation also achieves the same purpose, but as we will see in the business case, punctuation is important if you want to split your documents by sentences.

In removing stop words, you are getting rid of the common words that have no value; in fact, they are detrimental to the analysis as their frequency masks the important words. Examples of stop words are and, is, the, not, and to. Removing whitespace makes a more compact corpus by getting rid of things such as tabs, paragraph breaks, double-spacing, and so on.

The stemming of the words can get a bit tricky and might add to your confusion because it deletes the word suffixes, creating the base word or what is known as the radical. We will use the stemming algorithm included in the R package, tm, where the function calls the Porter stemming algorithm. An example of stemming would be where your corpus has family and families. Recall that R would count this as two separate words. By running the stemming algorithm, the stemmed word for the two instances would become famili. This would prevent the incorrect count, but in some cases, it can be odd to interpret and is not very visually appealing in a wordcloud for presentation purposes. In some cases, it may make sense to run your analysis with both stemmed and unstemmed words in order to see which one makes sense.

Probably the most optional of the transformations is to replace the words. The goal of replacement is to combine the words with a similar meaning, for example, management and leadership. You can also use it in lieu of stemming. I once examined the outcome of stemmed and unstemmed words and concluded that I could achieve a more meaningful result by replacing about a dozen words instead of stemming. We will see in the business case that you can use replacement to delete unnecessary text and characters.

With the transformation of the corpus completed, the next step is to create either a Document-Term Matrix (DTM) or Term-Document Matrix (TDM). What either of these matrices do is create a matrix of word counts for each individual document in the matrix. A DTM would have the documents as rows and the words as columns, while in a TDM, the reverse is true. The text mining can be performed on either matrix.

With a matrix, you can begin to analyze the text by examining the word counts and producing visualizations such as wordclouds. One can also find word associations by producing correlation lists for specific words. It also serves as a necessary data structure in order to build topic models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset