Topic models

Topic models are a powerful method to group documents by their main topics. Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. (Grun and Hornik, 2011) In essence, a document is assigned to a topic based on the distribution of the words in that document, and the other documents in that topic will have roughly the same frequency of words.

The algorithm that we will focus on is Latent Dirichlet Allocation (LDA) with Gibbs sampling, which is probably the most commonly used sampling algorithm. In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no apriori reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. LDA with Gibbs sampling is quite complicated mathematically, but my intent is to provide an introduction so that you are at least able to describe how the algorithm learns to assign a document to a topic in layman terms. If you are interested in mastering the math, block out a couple of hours on your calendar and have a go at it. Excellent background material is available at https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf.

LDA is a generative process and so the following will iterate to a steady state:

  1. For each document (j), there are 1 to j documents. We will randomly assign it a multinomial distribution (dirichlet distribution) to the topics (k) with 1 to k topics, for example, document A is 25 percent topic one, 25 percent topic two, and 50 percent topic three.
  2. Probabilistically, for each word (i), there are 1 to i words to a topic (k), for example, the word mean has a probability of 0.25 for the topic statistics.
  3. For each word(i) in document(j) and topic(k), calculate the proportion of words in that document assigned to that topic; note it as the probability of topic(k) with document(j), p(k│j), and the proportion of word(i) in topic(k) from all the documents containing the word. Note it as the probability of word(i) with topic(k), p(i│k).
  4. Resample, that is, assign w a new t based on the probability that t contains w, which is based on p(k│j) times p(i│k).
  5. Rinse and repeat; over numerous iterations, the algorithm finally converges and a document is assigned a topic based on the proportion of words assigned to a topic in that document.

The LDA that we will be doing assumes that the order of words and documents do not matter. There has been work done to relax these assumptions in order to build models of language generation and sequence models over time (known as dynamic topic modelling).

Other quantitative analyses

We will now shift gears to analyze text semantically based on sentences and the tagging of words based on the parts of speech, such as noun, verb, pronoun, adjective, adverb, preposition, singular, plural, and so on. Often, just examining the frequency and latent topics in the text will suffice for your analysis. However, you may find occasions where a deeper understanding of the style is required in order to compare the speakers or writers.

There are many methods to accomplish this task, but we will focus on the following five:

  • Polarity (sentiment analysis)
  • Automated readability index (complexity)
  • Formality
  • Diversity
  • Dispersion

Polarity is often referred to as sentiment analysis, which tells you how positive or negative is the text. By analyzing polarity in R with the qdap package, a score will be assigned to each sentence and you can analyze the average and standard deviation of polarity by groups such as different authors, text, or topics. Different polarity dictionaries are available and qdap defaults to one created by Hu and Liu, 2004. You can alter or change this dictionary according to your requirements.

The algorithm works by first tagging the words with a positive, negative, or neutral sentiment based on the dictionary. The tagged words are then clustered based on the four words prior and two words after a tagged word and these clusters are tagged with what are known as valence shifters (neutral, negator, amplifier, and de-amplifier). A series of weights based on their number and position are applied to both the words and clusters. This is then summed and divided by the square root of the numbers of words in that sentence.

Automated readability index is a measure of the text complexity and a reader's ability to understand. A specific formula is used to calculate this index: 4.71(# of characters / #of words) + 0.5(# of words / # of sentences) – 21.43.

The index produces a number, which is a rough estimate of a student's grade level to fully comprehend. If the number is 9, then a high school freshman, aged 13 to 15, should be able to grasp the meaning of the text.

The formality measure provides an understanding of how a text relates to the reader or speech relates to a listener. I like to think of it as a way to understand how comfortable the person producing the text is with the audience or an understanding of the setting where this communication takes place. If you want to experience formal text, attend a medical conference or read a legal document. Informal text is said to be contextual in nature.

The formality measure is called F-Measure. This measure is calculated as follows:

  • Formal words (f) are nouns, adjectives, prepositions, and articles
  • Contextual words (c) are pronouns, verbs, adverbs, and interjections
  • N = sum of (f + c + conjunctions)
  • Formality Index = 50((sum of f – sum of c / N) + 1)

This is totally irrelevant, but when I was in Iraq, one of the Army Generals—who shall remain nameless—I had to brief and write situation reports for was absolutely adamant that adverbs were not to be used, ever, or there would be wrath. The idea was that you can't quantify words such as highly or mostly because they mean different things to different people. Five years later, I still scour my business e-mails and PowerPoint presentations for unnecessary adverbs. Formality writ large!

Diversity, as it relates to text mining, refers to the number of different words used in relation to the total number of words used. This can also mean the expanse of the text producer's vocabulary or lexicon richness. The qdap package provides five—that's right, five—different measures of diversity: Simpson, Shannon, Collision, Bergen Parker, and Brillouin. I won't cover these five in detail but will only say that the algorithms are used not only for communication and information science retrieval, but also for biodiversity in nature.

Finally, dispersion, or lexical dispersion, is a useful tool in order to understand how words are spread throughout a document and serves as an excellent way to explore the text and identify patterns. The analysis is conducted by calling the specific word or words of interest, which are then produced in a plot showing when the word or words occurred in the text over time. As we will see, the qdap package has a built-in plotting function to analyze the text dispersion.

We covered a framework on text mining about how to prepare the text, count words, and create topic models and finally, dived deep into other lexical measures. Now, let's apply all this and do some real-world text mining.

Business understanding

For this case study, we will take a look at President Obama's State of the Union speeches. I have no agenda here; just curious as to what can be uncovered in particular and if and how his message has changed over time. Perhaps this will serve as a blueprint to analyze any politician's speech in order to prepare an opposing candidate in a debate or speech of their own. If not, so be it.

The two main analytical goals are to build topic models on the six State of the Union speeches and then compare the first speech in 2010 with the most recent speech in January, 2015 for sentence-based textual measures, such as sentiment and dispersion.

Data understanding and preparation

The primary package that we will use is tm, the text mining package. We will also need SnowballC for the stemming of the words, RColorBrewer for the color palettes in wordclouds, and the wordcloud package. Please ensure that you have these packages installed before attempting to load them:

> library(tm)

> library(SnowballC)

> library(wordcloud)

> library(RColorBrewer)

To bring the data in R, we could scrape the www.whitehouse.gov website. I dismissed this idea as this chapter would turn into a web scraping exposition and not one on text mining. So, I've pasted and stored the necessary data in the free website, www.textuploader.com. Each year's speech has a separate URL and we will only need to reference them to acquire the data. Two functions will accomplish this for us; the first being scan(), which reads the data and paste() to concatenate it properly:

> sou2010 = paste(scan(url("http://textuploader.com/a5vq4/raw"), what="character"),collapse=" ")
Read 7415 items

This is the 2010 speech. Now, one issue that you need to deal with when using text data in R is that it should be in the ASCII format. If not (the 2010 speech is not), then you must convert it to ASCII. The text that we pulled in previously is filled with numerous non-ASCII characters that would take many lines of code to try and delete/replace with the gsub() function. However, let's deal with this problem in one line of code, putting the iconv() function to good use. Remember that if you pull in text to R and see a number of funky characters, check if you need to convert it:

> sou2010=iconv(sou2010, "latin1", "ASCII", "")

We can pull up the entire speech by making a call to sou2010, but I'll just present the first few and last few sentences:

> sou2010
[1] "THE PRESIDENT: Madam Speaker, Vice President Biden,  members of Congress, distinguished guests, and fellow Americans: Our Constitution declares that from time to time, the President shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They've done so during periods of prosperity and tranquility. And they've done so in the midst of war and depression; at moments of great strife and great struggle.………………………………………………………………………………………………………………
Let's seize this moment -- to start anew, to carry the dream forward, and to strengthen our union once more. (Applause.) Thank you. God bless you. And God bless the United States of America. (Applause.)

Let's bring in the other five speeches:

> sou2011 = paste(scan(url("http://textuploader.com/a5vm0/raw"), what="character"),collapse=" ")
Read 7017 items

> sou2011=iconv(sou2011, "latin1", "ASCII", "")

> sou2012 = paste(scan(url("http://textuploader.com/a5vmp/raw"), what="character"),collapse=" ")
Read 7132 items

> sou2012=iconv(sou2012, "latin1", "ASCII", "")

> sou2013 = paste(scan(url("http://textuploader.com/a5vh0/raw"), what="character"),collapse=" ")
Read 6908 items

> sou2013=iconv(sou2013, "latin1", "ASCII", "")

> sou2014 = paste(scan(url("http://textuploader.com/a5vhp/raw"), what="character"),collapse=" ")
Read 6829 items

> sou2014=iconv(sou2014, "latin1", "ASCII", "")

> sou2015 = paste(scan(url("http://textuploader.com/a5vhb/raw"), what="character"),collapse=" ")
Read 6849 items

> sou2015=iconv(sou2015, "latin1", "ASCII", "")

We should put this in a file to hold the documents that will form the corpus. If you don't know the current working directory, you can pull it up with getwd() and change it with setwd(). Do not put your text files with any other file; create a new file folder for the speeches, otherwise your corpus will contain the R code, some other file, or blow up when you try to create it:

> getwd()
[1] "C:/Users/clesmeister/chap12/textmine"

> write.table(sou2010, "c:/Users/clesmeister/chap12/text/sou2010.txt")

> write.table(sou2011, "c:/Users/clesmeister/chap12/text/sou2011.txt")

> write.table(sou2012, "c:/Users/clesmeister/chap12/text/sou2012.txt")

> write.table(sou2013, "c:/Users/clesmeister/chap12/text/sou2013.txt")

> write.table(sou2014, "c:/Users/clesmeister/chap12/text/sou2014.txt")

> write.table(sou2015, "c:/Users/clesmeister/chap12/text/sou2015.txt")

We can now begin to create the corpus by first creating an object with the path to the speeches and then seeing how many files are in this directory and what they are named:

> name = file.path("C:/Users/clesmeister/chap12/text")

> length(dir(name))
[1] 6

> dir(name)
[1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt"
[5] "sou2014.txt" "sou2015.txt"

We will call the corpus docs and it is created with the Corpus() function, wrapped around the DirSource() function, which is also part of the tm package:

> docs = Corpus(DirSource(name))
> docs
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

Note that there is no corpus or document level Metadata in this data. There are functions in the tm package to apply things such as authors' names and timestamp information among others at both document level and corpus. We will not utilize this for our purposes.

We can now begin the text transformations using the tm_map() function from the tm package. These will be the transformations that we discussed previously—lowercase letters, remove numbers, remove punctuation, remove stop words, strip out the whitespace, and stem the words:

> docs = tm_map(docs, tolower)

> docs = tm_map(docs, removeNumbers)

> docs = tm_map(docs, removePunctuation)

> docs = tm_map(docs, removeWords, stopwords("english"))

> docs = tm_map(docs, stripWhitespace)

> docs = tm_map(docs, stemDocument)

At this point, it is a good idea to eliminate the unnecessary words. For example, during the speeches, when Congress applauds a statement, you will find (Applause) in the text. This must go away. Keep in mind that we stemmed the documents and so we need to get rid of applaus:

> docs = tm_map(docs, removeWords,c(""applaus"",""can"",""cant"",""will"",""that"",""weve"", ""dont"",""wont""))

After completing the transformations and removal of other words, make sure that your documents are plain text, put it in a document-term matrix, and check the dimensions:

> docs = tm_map(docs, PlainTextDocument)

> dtm = DocumentTermMatrix(docs)

> dim(dtm)
[1]    6 3080

The six speeches contain 3080 words. It is optional, but one can remove the sparse terms with the removeSparseTerms() function. You will need to specify a number between zero and one where the higher the number, the higher the percentage of sparsity in the matrix. So, with six documents, by specifying 0.51 as the sparsity number, the resulting matrix would have words that occurred in at least three documents, as follows:

> dtm = removeSparseTerms(dtm, 0.51)

> dim(dtm)
[1]    6 1132

As we don't have the metadata on the documents, it is important to name the rows of the matrix so that we know which document is which:

> rownames(dtm) = c("2010","2011","2012","2013","2014","2015")

Using the inspect() function, you can examine the matrix. Here, we will look at all the six rows and the first five columns:

> inspect(dtm[1:6, 1:5])
      Terms
Docs   abl abroad absolut abus accept
  2010    1          2          2       1         1
  2011    4          3          0       0         0
  2012    3          1          1       1         0
  2013    3          2          1       0         1
  2014    1          4          0       0         0
  2015    1          1          0       2         1

It appears that our data is ready for analysis, starting with looking at the word frequency counts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset