Modeling and evaluation

Modeling will be broken in two distinct parts. The first will focus on word frequency and correlation and culminate in the building of a topic model. In the next portion, we will examine many different quantitative techniques by utilizing the power of the qdap package in order to compare two different speeches.

Word frequency and topic models

As we have everything set up in the document-term matrix, we can move on to exploring word frequencies by creating an object with the column sums, sorted in descending order. It is necessary to use as.matrix() in the code to sum the columns. The default order is ascending, so putting - in front of freq will change it to descending:

> freq = colSums(as.matrix(dtm))

> ord = order(-freq)

We will examine head and tail of the object with the following code:

> freq[head(ord)]
american     year      job     work  america      new 
     243      241      212      195      187      177 

> freq[tail(ord)]
      voic     welcom worldclass    yearold      yemen 
         3          3          3          3          3 
     youll 
         3

The most frequent word is american—as you might expect from the President— but notice how important its employment is with job and work. You can see how stemming changed voice to voic and welcome/welcoming/welcomed to welcom.

To look at the frequency of the word frequency, you can create tables, as follows:

> head(table(freq))
freq
  3   4   5   6   7   8 
127 118 112  75  65  50 

> tail(table(freq))
freq
177 187 195 212 241 243 
  1   1   1   1   1   1

What these tables show is the number of words with that specific frequency, so 127 words occurred three times and one word, american in our case, occurred 243 times.

Using findFreqTerms(), we can see what words occurred at least 100 times. Looks like he talked quite a bit about business and it is clear that the government, including the IRS, is here to "help", perhaps even help "now". That is a relief!

> findFreqTerms(dtm, 100)
 [1] "america"  "american" "busi"     "countri"  "everi"   
 [6] "get"      "help"     "job"      "let"      "like"    
[11] "make"     "need"     "new"      "now"      "one"     
[16] "peopl"    "right"    "time"     "work"     "year"

You can find associations with words by correlation with the findAssocs() function. Let's look at business and also job as two examples using 0.9 as the correlation cutoff:

> findAssocs(dtm, "busi", corlimit=0.9)
$busi
 drop eager  hear  fund   add  main track 
 0.98   0.98  0.92  0.91  0.90   0.90  0.90 

> findAssocs(dtm, "job", corlimit=0.9)
$job
    hightech          lay      announc        natur 
         0.94         0.94             0.93         0.93 
         aid   alloftheabov        burma      cleaner 
        0.92              0.92             0.92         0.92 
        ford       gather       involv         poor 
        0.92         0.92           0.92         0.92 
    redesign        skill        yemen        sourc 
        0.92           0.92           0.92          0.91

Business needs further exploration, but jobs is interesting in the focus on high-tech jobs. It is curious that burma and yemen show up; I guess we still have a job to do on these countries, certainly in yemen.

For visual portrayal, we can produce wordclouds and a bar chart. We will do two wordclouds to show the different ways to produce them: one with a minimum frequency and the other by specifying the maximum number of words to include. The first one with minimum frequency also includes code to specify the color. The scale syntax determines the minimum and maximum word size by frequency; in this case, the minimum frequency is 50:

> wordcloud(names(freq), freq, min.freq=50, scale=c(3, .5), colors=brewer.pal(6, "Dark2"))

The output of the preceding command is as follows:

Word frequency and topic models

One can forgo all the fancy graphics as we will in the following image, capturing 30 most frequent words:

> wordcloud(names(freq), freq, max.words=30)

The output of the preceding command is as follows:

Word frequency and topic models

To produce a bar chart, the code can get a bit complicated, whether you use base R, ggplot2, or lattice. The following code will show you how to produce a bar chart for the 10 most frequent words in base R:

> freq = sort(colSums(as.matrix(dtm)), decreasing=TRUE)

> wf = data.frame(word=names(freq), freq=freq)

> wf = wf[1:10,]

> barplot(wf$freq, names=wf$word, main="Word Frequency", xlab="Words", ylab="Counts", ylim=c(0,250))

The output of the preceding command is as follows:

Word frequency and topic models

We will now move on to the building of topic models using the topicmodels package, which offers the LDA() function. The question now is how many topics to create. It seems logical to solve for three or four, so we will try both, starting with three topics (k=3):

> library(topicmodels)

> set.seed(123)

> lda3 = LDA(dtm, k=3, method="Gibbs")

> topics(lda3)
2010 2011 2012 2013 2014 2015 
   3    3    1    1    2    2

We can see that topics are grouped every two years.

Now we will try for topics (k=4):

> set.seed(456)

> lda4 = LDA(dtm, k=4, method="Gibbs")

> topics(lda4)
2010 2011 2012 2013 2014 2015 
   4    4    3    2    1    1

Here, the topic groupings are similar to the preceding ones, except that the 2012 and 2013 speeches have their own topics. For simplicity, let's have a look at three topics for the speeches. Using the terms() function produces a list of an ordered word frequency for each topic. The list of words is specified in the function, so let's look at the top 20 per topic:

> terms(lda3,20)
      Topic 1    Topic 2    Topic 3  
 [1,] "american" "new"      "year"   
 [2,] "job"      "america"  "peopl"  
 [3,] "now"      "work"     "know"   
 [4,] "right"    "help"     "nation" 
 [5,] "get"      "one"      "last"   
 [6,] "tax"      "everi"    "take"   
 [7,] "busi"     "need"     "invest" 
 [8,] "energi"   "make"     "govern" 
 [9,] "home"     "world"    "school" 
[10,] "time"     "countri"  "also"   
[11,] "like"     "let"      "cut"    
[12,] "million"  "congress" "two"    
[13,] "give"     "state"    "next"   
[14,] "well"     "want"     "come"   
[15,] "compani"  "tonight"  "deficit"
[16,] "reform"   "first"    "chang"  
[17,] "back"     "futur"    "famili" 
[18,] "educ"     "keep"     "care"   
[19,] "put"      "today"    "economi"
[20,] "unit"     "worker"   "work"

Topic 3 covers the first two speeches. Some key words stand out, such as "invest", "school", "economi", and "deficit". During this time, Congress passed and implemented the $787 billion American Recovery and Reinvestment Act with the goal of stimulating the economy.

Topic 1 covers the next two speeches. Here, the message transitions to "job", "tax", "busi", and what appears to be some comments on the "energi" policy. A supposed comprehensive policy put forward under the rhetorical All of the above in the 2012 speech. Note the association with the rhetorical comment and jobs when we examined it with findAssocs().

Topic 2 brings us to the last two speeches. There doesn't appear to be a clear topic that rises to the surface like the others. It appears that these speeches were less about specific calls to action and more about what was done and the future vision of the country and the world. In the next section, we can dig into the exact speech content further, along with comparing and contrasting his first State of the Union speech with the most recent one.

Additional quantitative analysis

This portion of the analysis will focus on the power of the qdap package. It allows you to compare multiple documents over a wide array of measures. Our effort will be on comparing the 2010 and 2015 speeches. For starters, we will need to turn the text into data frames, perform sentence splitting, and then combine them to one data frame with a variable created that specifies the year of the speech. We will use this as our grouping variable in the analyses. You can include multiple variables in your groups. We will not need to do any of the other transformations such as stemming or lowering the case.

Before creating a data frame, we will need to get rid of that pesky (Applause.) text with the gsub function. We will also need to load the library:

> library(qdap)

> state15 = gsub("(Applause.)", "", sou2015)

Now, put this in df and split it into sentences, which will put one sentence per row. As proper punctuation is in the text, you can use the sentSplit function. If punctuation was not there, other functions are available to detect the sentences:

> speech15 = data.frame(speech=state15)

> sent15 = sentSplit(speech15, "speech")

The last thing is to create the year variable:

> sent15$year = "2015"

Repeat the steps for the 2010 speech:

> state10 = gsub("(Applause.)", "", sou2010)

> speech10 = data.frame(speech=state10)

> sent10 = sentSplit(speech10, "speech")

> sent10$year = "2010"

Now, concatenate the two datasets:

> sentences = rbind(sent10, sent15)

To compare the polarity (sentiment scores), use the polarity() function, specifying the text and grouping variables:

> pol = polarity(sentences$speech, sentences$year)

> pol
  year total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
1 2010             443        7233        0.040       0.319              0.124
2 2015             378        6712        0.098       0.274              0.356

The stan.mean.polarity value represents the standardized mean polarity, which is the average polarity divided by the standard deviation. We see that 2015 was slightly higher (0.356) than 2010 (0.124). This is in line with what we expect. You can also plot the data. The plot produces two charts. The first shows the polarity by sentences over time and the second shows the distribution of the polarity:

> plot(pol)

The output of the preceding command is as follows:

Additional quantitative analysis

This plot may be a challenge to read in this text, but let me do my best to interpret it. The 2010 speech starts out with a strong negative sentiment and is more negative than 2015. We can identify this sentence by creating a data frame of the pol object, find the sentence number, and call this sentence:

> pol.df = pol$all

> which.min(pol.df$polarity)
[1] 12

> pol.df$text.var[12]

[1] "One year ago, I took office amid two wars, an economy rocked by a severe recession, a financial system on the verge of collapse, and a government deeply in debt.

Now that is negative sentiment! We will look at the readability index next:

> ari = automated_readability_index(sentences$speech, sentences$year) 

> ari$Readability
  year word.count sentence.count character.count
1 2010       7207            443           33623
2 2015       6671            378           30469
  Automated_Readability_Index
1                    8.677994
2                    8.906440

I think it is no surprise that they are basically the same. Formality analysis is next. This takes a couple of minutes to run in R:

> form = formality(sentences$speech, sentences$year)

> form
  year word.count formality
1 2015       6676     62.49
2 2010       7412     58.88

This looks to be very similar. We can examine the proportion of the parts of the speech and also produce a plot that confirms this, as follows:

> form$form.prop.by
  year word.count  noun   adj  prep articles pronoun
1 2010       7412 24.22 11.39 14.64     6.46   10.75
2 2015       6676 24.94 12.46 16.37     6.34   10.23
   verb adverb interj other
1 21.57   6.58   0.03  4.36
2 19.19   5.69   0.01  4.76

> plot(form)

The following is the output of the preceding command:

Additional quantitative analysis

Now, the diversity measures have been produced. Again, they are nearly identical. A plot is also available, (plot(div)), but being so similar, it adds no value. It is important to note that Obama's speech writer for 2010 was Jon Favreau, and in 2015, it was Cody Keenan:

> div = diversity(sentences$speech, sentences$year)

> div
  year   wc simpson shannon collision berger_parker brillouin
1 2010 7207   0.992   6.163     4.799         0.047     5.860
2 2015 6671   0.992   6.159     4.791         0.039     5.841

One of my favorite plots is the dispersion plot. This shows the dispersion of a word throughout the text. Let's examine the dispersion of "jobs", "families", and "economy":

> dispersion_plot(sentences$speech, grouping.var=sentences$year, c("economy","jobs","families"), color="black", bg.color="white")
Additional quantitative analysis

This is quite interesting as these topics were discussed early on in the 2010 speech but at the end in the 2015 speech.

Many of the tasks that we performed earlier with the tm package can also be done in qdap. So, the last thing that I want to do is show you how to execute the word frequency with qdap and count the top ten words for each speech. This is easy with the freq_terms() function. In addition to specifying the top ten words, we will also specify one of the stopwords defaults available in qdap. In this case, 200 versus the other option of 100:

> freq2010 = freq_terms(sent10$speech, top=10, stopwords=Top200Words)

> freq2010
   WORD       FREQ
1  americans    28
2  that's       26
3  jobs         23
4  it's         20
5  years        19
6  american     18
7  businesses   18
8  those        18
9  families     17
10 last         16

> freq2015 = freq_terms(sent15$speech, top=10, stopwords=Top200Words)

> freq2015
   WORD      FREQ
1  that's      28
2  years       25
3  every       24
4  american    19
5  country     19
6  economy     18
7  jobs        18
8  americans   17
9  lets        17
10 families    16

This completes our analysis of the two speeches. I must confess that I did not listen to any of these speeches. In fact, I haven't watched a State of the Union address since Reagan was President with the exception of the 2002 address. This provided some insight for me on how the topics and speech formats have changed over time to accommodate political necessity, while the overall style of formality and sentence structure has remained consistent. Keep in mind that this code can be adapted to text for dozens, if not hundreds, of documents and with multiple speakers, for example, screenplays, legal proceedings, interviews, social media, and on and on. Indeed, text mining can bring quantitative order to what has been qualitative chaos.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset