Additional quantitative analysis

This portion of the analysis will focus on the power of the qdap package. It allows you to compare multiple documents over a wide array of measures. Our effort will be on comparing the 2010 and 2016 speeches. For starters, we will need into turn the text into data frames, perform sentence splitting, and then combine them to one data frame with a variable created that specifies the year of the speech. We will use this as our grouping variable in the analyses. Dealing with text data, even in R, can be tricky. The code that follows seemed to work the best in this case to get the data loaded and ready for analysis. We first load the qdap package. Then, to bring in the data from a text file, we will use the readLines() function from base R, collapsing the results to eliminate unnecessary whitespace. I also recommend putting your text encoding to ASCII, otherwise you may run into some bizarre text that will mess up your analysis. That is done with the iconv() function:

    > library(qdap)

> speech16 <- paste(readLines("sou2016.txt"), collapse=" ")
Warning message:
In readLines("sou2016.txt") : incomplete final line found on
'sou2016.txt'

> speech16 <- iconv(speech16, "latin1", "ASCII", "")

The warning message is not an issue as it is just telling us that the final line of text is not the same length as the other lines in the .txt file. We now apply the qprep() function from qdap.

This function is a wrapper for a number of other replacement functions and using it will speed pre-processing, but it should be used with caution if more detailed analysis is required. The functions it passes through are as follows:

  • bracketX(): apply bracket removal
  • replace_abbreviation(): replaces abbreviations
  • replace_number(): numbers to words, for example '100' becomes 'one hundred'
  • replace_symbol(): symbols become words, for example @ becomes 'at'
    > prep16 <- qprep(speech16)

The other pre-processing we should do is to replace contractions (can't to cannot), remove stopwords, in our case the top 100, and remove unwanted characters, with the exception of periods and question marks. They will come in handy shortly:

    > prep16 <- replace_contraction(prep16)

> prep16 <- rm_stopwords(prep16, Top100Words, separate = F)

> prep16 <- strip(prep16, char.keep = c("?", "."))

Critical to this analysis is to now split it into sentences and add what will be the grouping variable, the year of the speech. This also creates the tot variable, which stands for Turn of Talk, serving as an indicator of sentence order. This is especially helpful in a situation where you are analyzing dialogue, say in a debate or question and answer session:

    > sent16 <- data.frame(speech = prep16)

> sent16 <- sentSplit(sent16, "speech")

> sent16$year <- "2016"

Repeat the steps for the 2010 speech:

    > speech10 <- paste(readLines("sou2010.txt"), collapse=" ")

> speech10 <- iconv(speech10, "latin1", "ASCII", "")


> speech10 <- gsub("(Applause.)", "", speech10)


> prep10 <- qprep(speech10)


> prep10 <- replace_contraction(prep10)


> prep10 <- rm_stopwords(prep10, Top100Words, separate = F)


> prep10 <- strip(prep10, char.keep = c("?", "."))


> sent10 <- data.frame(speech = prep10)


> sent10 <- sentSplit(sent10, "speech")


> sent10$year <- "2010"

Concatenate the separate years into one dataframe:

    > sentences <- data.frame(rbind(sent10, sent16))

One of the great things about the qdap package is that it facilitates basic text exploration, as we did before. Let's see a plot of frequent terms:

    > plot(freq_terms(sentences$speech))

The output of the preceding command is as follows:

You can create a word frequency matrix that provides the counts for each word by speech:

    > wordMat <- wfm(sentences$speech, sentences$year)

> head(wordMat[order(wordMat[, 1], wordMat[, 2],decreasing =
TRUE),])

2010 2016
our 120 85
us 33 33
year 29 17
americans 28 15
why 27 10
jobs 23 8

This can also be converted into a document-term matrix with the function as.dtm() should you so desire. Let's next build wordclouds, by year with qdap functionality:

    > trans_cloud(sentences$speech, sentences$year, min.freq = 10)

The preceding command produces the following two images:

Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses some of its visual appeal with just two speeches, but is revealing nonetheless. A complete explanation of the stats is available under ?word_stats:

    > ws <- word_stats(sentences$speech, sentences$year, rm.incomplete = T)

> plot(ws, label = T, lab.digits = 2)

The output of the preceding command is as follows:

Notice that the 2016 speech was much shorter, with over a hundred fewer sentences and almost a thousand fewer words. Also, there seems to be the use of asking questions as a rhetorical device in 2016 versus 2010 (n.quest 10 versus n.quest 4).

To compare the polarity (sentiment scores), use the polarity() function, specifying the text and grouping variables:

    > pol = polarity(sentences$speech, sentences$year)

> pol
year total.sentences total.words ave.polarity sd.polarity
stan.mean.polarity
1 2010 435 3900 0.052 0.432
0.121
2 2016 299 2982 0.105 0.395
0.267

The stan.mean.polarity value represents the standardized mean polarity, which is the average polarity divided by the standard deviation. We see that 2015 was slightly higher (0.267) than 2010 (0.121). This is in line with what we would expect, wanting to end on a more positive note. You can also plot the data. The plot produces two charts. The first shows the polarity by sentences over time and the second shows the distribution of the polarity:

    > plot(pol)

The output of the preceding command is as follows:

This plot may be a challenge to read in this text, but let me do my best to interpret it. The 2010 speech starts out with a strong negative sentiment and is slightly more negative than 2016. We can identify the most negative sentiment sentence by creating a dataframe of the pol object, find the sentence number, and produce it:

    > pol.df <- pol$all

> which.min(pol.df$polarity)
[1] 12

> pol.df$text.var[12]

[1] "One year ago, I took office amid two wars, an economy rocked
by a severe recession, a financial system on the verge of
collapse, and a government deeply in debt.

Now that is negative sentiment! Ironically, the government is even more in debt today. We will look at the readability index next:

    > ari <- automated_readability_index(sentences$speech, 
sentences$year)


> ari$Readability
year word.count sentence.count character.count
1 2010 3900 435 23859
2 2016 2982 299 17957
Automated_Readability_Index
1 11.86709
2 11.91929

I think it is no surprise that they are basically the same. Formality analysis is next. This takes a couple of minutes to run in R:

    > form <- formality(sentences$speech, sentences$year)

> form
year word.count formality
1 2016 2983 65.61
2 2010 3900 63.88

This looks to be very similar. We can examine the proportion of the parts of the speech. A plot is available, but adds nothing to the analysis, in this instance:

    > form$form.prop.by
year word.count noun adj prep articles pronoun
1 2010 3900 44.18 15.95 3.67 0 4.51
2 2016 2982 43.46 17.37 4.49 0 4.96
verb adverb interj other
1 23.49 7.77 0.05 0.38
2 21.73 7.41 0.00 0.57

Now, the diversity measures are produced. Again, they are nearly identical. A plot is also available, (plot(div)), but being so similar, it once again adds no value. It is important to note that Obama's speech writer for 2010 was Jon Favreau, and in 2016, it was Cody Keenan:

    > div <- diversity(sentences$speech, sentences$year)

> div
year wc simpson shannon collision berger_parker brillouin
1 2010 3900 0.998 6.825 5.970 0.031 6.326
2 2015 2982 0.998 6.824 6.008 0.029 6.248

One of my favorite plots is the dispersion plot. This shows the dispersion of a word throughout the text. Let's examine the dispersion of "jobs", "families", and "economy":

   > dispersion_plot(sentences$speech,
rm.vars = sentences$year,
c("security", "jobs", "economy"),
color = "black", bg.color = "white")

The output of the preceding command is as follows:

This is quite interesting as you can visualize how much longer the 2010 speech is. In 2010, the first half of his speech was focused heavily on jobs while in 2016 it appears it was more about the state of the overall economy; no doubt how much of a hand he played in saving it from the brink of disaster. In 2010, security was not brought in until later in the speech versus placed throughout the final address. You can see and understand how text analysis can provide insight into what someone is thinking, what their priorities are, and how they go about communicating them.

This completes our analysis of the two speeches. I must confess that I did not listen to any of these speeches. In fact, I haven't watched a State of the Union address since Reagan was president, probably with the exception of the 2002 address. This provided some insight for me on how the topics and speech formats have changed over time to accommodate political necessity, while the overall style of formality and sentence structure has remained consistent. Keep in mind that this code can be adapted to text for dozens, if not hundreds, of documents and with multiple speakers, for example screenplays, legal proceedings, interviews, social media, and on and on. Indeed, text mining can bring quantitative order to what has been qualitative chaos.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset