Additional quantitative analysis

This portion of the analysis will focus on the power of the qdap package. It allows you to compare multiple documents over a wide array of measures. Our effort will be on comparing Teddy Roosevelt's 1908 written address and Ronald Reagan's 1982 speech. For starters, we will need to turn the text into data frames, perform sentence splitting, and then combine them to one data frame with a variable created that specifies the President. We will use this as our grouping variable in the analysis. Dealing with text data, even in R, can be tricky. The code that follows seemed to work the best, in this case, to get the data loaded and ready for analysis. I've created two text files of the addresses that I scraped off the internet. Help yourself to the files on GitHub at https://github.com/datameister66/MMLR3rd.

The files are called tr.txt and reagan.txt.

We will use the readLines() function from base R, collapsing the results to eliminate unnecessary whitespace. I also recommend putting your text encoding to ASCII, otherwise you may run into some bizarre text that will mess up your analysis. That is done with the iconv() function:

> tr <- paste(readLines("~/corpus/tr.txt"), collapse=" ")

> tr <- iconv(tr, "latin1", "ASCII", "")

The warning message is not an issue, as it is just telling us that the final line of text is not the same length as the other lines in the .txt file. We now apply the qprep() function from qdap.

This function is a wrapper for a number of other replacement functions and using it will speed up preprocessing, but it should be used with caution if more detailed analysis is required. The functions it passes through are as follows:

bracketX(): Applies bracket removal
replace_abbreviation(): Replaces abbreviations
replace_number(): Converts numbers to words, for example, 100 becomes one hundred
replace_symbol(): Symbols become words, for example, @ becomes at

> prep_tr <- qdap::qprep(tr)

The other preprocessing we should do is to replace contractions (can't to cannot); remove stop words, in our case the top 100, and remove unwanted characters, with the exception of periods and question marks. They will come in handy shortly:

> prep_tr <- qdap::replace_contraction(prep_tr)

> prep_tr <- qdap::rm_stopwords(prep_tr, Top100Words, separate = F)

> prep_tr <- qdap::strip(prep_tr, char.keep = c("?", ".", "!"))

Critical to this analysis is to now split it into sentences and add what will be the grouping variable, the year of the speech. This also creates the tot variable, which stands for turn of talk, serving as an indicator of sentence order. This is especially helpful in a situation where you are analyzing dialogue, say in a debate or question and answer session:

> address_tr <- data.frame(speech = prep_tr)

> address_tr <- qdap::sentSplit(address_tr, "speech")

> address_tr$pres <- "TR"

Repeat the steps for the Ronald Reagan speech:

> reagan <- paste(readLines("C:/Users/cory/Desktop/data/corpus/reagan.txt"), collapse=" ")

> reagan <- iconv(reagan, "latin1", "ASCII", "")

> prep_reagan <- qdap::qprep(reagan)

> prep_reagan <- qdap::replace_contraction(prep_reagan)

> prep_reagan <- qdap::rm_stopwords(prep_reagan, Top100Words, separate = F)

> prep_reagan <- qdap::strip(prep_reagan, char.keep = c("?", ".", "!"))

> address_reagan <- data.frame(speech = prep_reagan)

> address_reagan <- qdap::sentSplit(address_reagan, "speech")

> address_reagan$pres <- "reagan"

Concatenate the separate years into one data frame:

> sentences <- dplyr::bind_rows(address_tr, address_reagan)

One of the great things about the qdap package is that it facilitates basic text exploration, as we did before. Let's see a plot of frequent terms:

> plot(qdap::freq_terms(sentences$speech))

The output of the preceding command is as follows:

You can create a word frequency matrix that provides the counts for each word by speech:

> wordMat <- qdap::wfm(sentences$speech, sentences$pres)

> head(wordMat[order(wordMat[, 1], wordMat[, 2],decreasing = TRUE),])
           reagan  TR
our            69 107
us             44  17
let            33  12
government     18  77
years          17  20
america        17   7

This can also be converted into a DTM with the as.dtm() function, should you so desire.

Comprehensive word statistics are available. Here are tables of the statistics available in the package. A complete explanation of the statistics is available under word_stats:

> ws <- qdap::word_stats(sentences$speech, sentences$pres, rm.incomplete = T)

> ws$word.elem
    pres    n.sent  n.words n.char n.syl n.poly     wps     cps 
1     TR       667    12071  80780 25862   3786  18.097 121.109
2 reagan       222     2732  16935  5421    704  12.306  76.284
               sps     psps    cpw   spw   pspw n.hapax   n.dis
1     TR    38.774    5.676   6.692 2.142 0.314    1829     639
2 reagan    24.419    3.171   6.199 1.984 0.258     815     191
         grow.rate prop.dis
1     TR     0.152    0.053
2 reagan     0.298    0.070

> ws$sent.elem
  n.state n.quest p.state p.quest
1     667       0   1.000   0.000
2     217       5   0.977   0.023

Notice that Reagan's speech was much shorter than Roosevelt's written address, with a third of the total sentences. Also, he made use of asking questions five times as a rhetorical device while TR did not (n.quest 5 versus n.quest 0).

To compare the polarity (sentiment scores), use the polarity() function, specifying the text and grouping variables:

> pol = qdap::polarity(sentences$speech, sentences$pres)

> pol
    pres total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
1 reagan             222        2732        0.185       0.407              0.456
2 TR                 667       12071        0.028       0.501              0.056

The stan.mean.polarity value represents the standardized mean polarity, which is the average polarity divided by the standard deviation. We see that Reagan has slightly higher sentiment than TR. This seems expected as the address has evolved from a written document to Congress, to a televised speech. You can also plot the data. The plot produces two charts. The first shows the polarity by sentences over time and the second shows the distribution of the polarity:

> plot(pol)

The output of the preceding command is as follows:

We can identify the most negative sentiment sentence by creating a data frame of the pol object, finding the sentence number, and producing it:

> pol.df <- pol$all

> which.min(pol.df$polarity)
[1] 86

> pol.df$text.var[86]
[1] "mobs frequently avenge commission crime themselves torturing death man committing thus avenging bestial fashion bestial deed reducing themselves level criminal."

Now that is negative sentiment! TR was actually quoting the Governor of Alabama about the horror of lynching. We will look at the readability index next:

> ari$Readability
    pres word.count sentence.count character.count
1 reagan       2732            222           16935
2     TR      12071            667           80780
  Automated_Readability_Index
1                    13.91929
2                    19.13838

Roosevelt's Automated Readability Index (ARI) is much higher than Reagan's ARI, a vestige of the language of his era. TR's sentences average 18 words. Formality analysis is next. This takes a couple of minutes to run in R, and you can overwhelm your memory if running it on a laptop or desktop computer. Therefore, we'll take a portion of TR's address, run it separately, then run it for Reagan:

> tr_sentences <- dplyr::filter(sentences, pres == "TR")

> tr_sentences <- tr_sentences[1:300, ]

> qdap::formality(tr_sentences$speech)
  all word.count formality
1 all       5726     72.08

> reagan_sentences <- dplyr::filter(sentences, pres == "reagan")

> formality(reagan_sentences$speech)
  all word.count formality
1 all       2732     67.15

TR is slightly more formal than Reagan.

Now, we will look at diversity measures. For most of the measures, TR is using a more diverse and richer lexicon than Reagan:

> diversity(sentences$speech, sentences$pres)
    pres    wc simpson shannon collision berger_parker brillouin
1 reagan  2732   0.998   6.653     5.896         0.025     6.104
2     TR 12071   0.999   7.491     6.659         0.011     7.101

One of my favorite plots is the dispersion plot. This shows the dispersion of a word throughout the text. Let's examine the dispersion of "peace", "government", and "marksmanship":

> dispersion_plot(
    sentences$speech,
    rm.vars = sentences$pres,
    c("peace", "government", "marksmanship"),
    color = "black",
    bg.color = "white"
 )

The output of the preceding command is as follows:

This is quite interesting as you can visualize how much longer TR's address is, as well as how he structured it to discuss foreign affairs later in the text. We can gain some insight into TR's mind with his discussion on marksmanship as he was looking at Switzerland as a shining example of how a populace could be armed and trained. You can see and understand how text analysis can provide insight into what someone is thinking, what their priorities are, and how they go about communicating them.

This completes our analysis of the two speeches. It provided some insight on to how the topics and speech formats have changed over time to accommodate political necessity. Keep in mind that this code can be adapted to text for dozens, if not hundreds, of documents and with multiple speakers, for example, screenplays, legal proceedings, interviews, social media, and so on. Indeed, text mining can bring quantitative order to what has been qualitative chaos.

Table of Contents for Additional quantitative analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Additional quantitative analysis