Additional quantitative analysis

This portion of the analysis will focus on the power of the qdap package. It allows you to compare multiple documents over a wide array of measures. Our effort will be on comparing Teddy Roosevelt's 1908 written address and Ronald Reagan's 1982 speech. For starters, we will need to turn the text into data frames, perform sentence splitting, and then combine them to one data frame with a variable created that specifies the President. We will use this as our grouping variable in the analysis. Dealing with text data, even in R, can be tricky. The code that follows seemed to work the best, in this case, to get the data loaded and ready for analysis. I've created two text files of the addresses that I scraped off the internet. Help yourself to the files on GitHub at https://github.com/datameister66/MMLR3rd.

The files are called tr.txt and reagan.txt.

We will use the readLines() function from base R, collapsing the results to eliminate unnecessary whitespace. I also recommend putting your text encoding to ASCII, otherwise you may run into some bizarre text that will mess up your analysis. That is done with the iconv() function:

> tr <- paste(readLines("~/corpus/tr.txt"), collapse=" ")

> tr <- iconv(tr, "latin1", "ASCII", "")

The warning message is not an issue, as it is just telling us that the final line of text is not the same length as the other lines in the .txt file. We now apply the qprep() function from qdap.

This function is a wrapper for a number of other replacement functions and using it will speed up preprocessing, but it should be used with caution if more detailed analysis is required. The functions it passes through are as follows:

  • bracketX(): Applies bracket removal
  • replace_abbreviation(): Replaces abbreviations
  • replace_number(): Converts numbers to words, for example, 100 becomes one hundred
  • replace_symbol(): Symbols become words, for example, @ becomes at
> prep_tr <- qdap::qprep(tr)

The other preprocessing we should do is to replace contractions (can't to cannot); remove stop words, in our case the top 100, and remove unwanted characters, with the exception of periods and question marks. They will come in handy shortly:

> prep_tr <- qdap::replace_contraction(prep_tr)

> prep_tr <- qdap::rm_stopwords(prep_tr, Top100Words, separate = F)

> prep_tr <- qdap::strip(prep_tr, char.keep = c("?", ".", "!"))

Critical to this analysis is to now split it into sentences and add what will be the grouping variable, the year of the speech. This also creates the tot variable, which stands for turn of talk, serving as an indicator of sentence order. This is especially helpful in a situation where you are analyzing dialogue, say in a debate or question and answer session:

> address_tr <- data.frame(speech = prep_tr)

> address_tr <- qdap::sentSplit(address_tr, "speech")

> address_tr$pres <- "TR"

Repeat the steps for the Ronald Reagan speech:

> reagan <- paste(readLines("C:/Users/cory/Desktop/data/corpus/reagan.txt"), collapse=" ")

> reagan <- iconv(reagan, "latin1", "ASCII", "")

> prep_reagan <- qdap::qprep(reagan)

> prep_reagan <- qdap::replace_contraction(prep_reagan)

> prep_reagan <- qdap::rm_stopwords(prep_reagan, Top100Words, separate = F)

> prep_reagan <- qdap::strip(prep_reagan, char.keep = c("?", ".", "!"))

> address_reagan <- data.frame(speech = prep_reagan)

> address_reagan <- qdap::sentSplit(address_reagan, "speech")

> address_reagan$pres <- "reagan"

Concatenate the separate years into one data frame:

> sentences <- dplyr::bind_rows(address_tr, address_reagan)

One of the great things about the qdap package is that it facilitates basic text exploration, as we did before. Let's see a plot of frequent terms:

> plot(qdap::freq_terms(sentences$speech))

The output of the preceding command is as follows:

You can create a word frequency matrix that provides the counts for each word by speech:

> wordMat <- qdap::wfm(sentences$speech, sentences$pres)

> head(wordMat[order(wordMat[, 1], wordMat[, 2],decreasing = TRUE),])
reagan TR
our 69 107
us 44 17
let 33 12
government 18 77
years 17 20
america 17 7

This can also be converted into a DTM with the as.dtm() function, should you so desire.

Comprehensive word statistics are available. Here are tables of the statistics available in the package. A complete explanation of the statistics is available under word_stats:

> ws <- qdap::word_stats(sentences$speech, sentences$pres, rm.incomplete = T)

> ws$word.elem
pres n.sent n.words n.char n.syl n.poly wps cps
1 TR 667 12071 80780 25862 3786 18.097 121.109
2 reagan 222 2732 16935 5421 704 12.306 76.284
sps psps cpw spw pspw n.hapax n.dis
1 TR 38.774 5.676 6.692 2.142 0.314 1829 639
2 reagan 24.419 3.171 6.199 1.984 0.258 815 191
grow.rate prop.dis
1 TR 0.152 0.053
2 reagan 0.298 0.070

> ws$sent.elem
n.state n.quest p.state p.quest
1 667 0 1.000 0.000
2 217 5 0.977 0.023

Notice that Reagan's speech was much shorter than Roosevelt's written address, with a third of the total sentences. Also, he made use of asking questions five times as a rhetorical device while TR did not (n.quest 5 versus n.quest 0).

To compare the polarity (sentiment scores), use the polarity() function, specifying the text and grouping variables:

> pol = qdap::polarity(sentences$speech, sentences$pres)

> pol
pres total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
1 reagan 222 2732 0.185 0.407 0.456
2 TR 667 12071 0.028 0.501 0.056

The stan.mean.polarity value represents the standardized mean polarity, which is the average polarity divided by the standard deviation. We see that Reagan has slightly higher sentiment than TR. This seems expected as the address has evolved from a written document to Congress, to a televised speech. You can also plot the data. The plot produces two charts. The first shows the polarity by sentences over time and the second shows the distribution of the polarity:

> plot(pol)

The output of the preceding command is as follows:

We can identify the most negative sentiment sentence by creating a data frame of the pol object, finding the sentence number, and producing it:

> pol.df <- pol$all

> which.min(pol.df$polarity)
[1] 86

> pol.df$text.var[86]
[1] "mobs frequently avenge commission crime themselves torturing death man committing thus avenging bestial fashion bestial deed reducing themselves level criminal."

Now that is negative sentiment! TR was actually quoting the Governor of Alabama about the horror of lynching. We will look at the readability index next:

> ari$Readability
pres word.count sentence.count character.count
1 reagan 2732 222 16935
2 TR 12071 667 80780
Automated_Readability_Index
1 13.91929
2 19.13838

Roosevelt's Automated Readability Index (ARI) is much higher than Reagan's ARI, a vestige of the language of his era. TR's sentences average 18 words. Formality analysis is next. This takes a couple of minutes to run in R, and you can overwhelm your memory if running it on a laptop or desktop computer. Therefore, we'll take a portion of TR's address, run it separately, then run it for Reagan:

> tr_sentences <- dplyr::filter(sentences, pres == "TR")

> tr_sentences <- tr_sentences[1:300, ]

> qdap::formality(tr_sentences$speech)
all word.count formality
1 all 5726 72.08

> reagan_sentences <- dplyr::filter(sentences, pres == "reagan")

> formality(reagan_sentences$speech)
all word.count formality
1 all 2732 67.15

TR is slightly more formal than Reagan.

Now, we will look at diversity measures. For most of the measures, TR is using a more diverse and richer lexicon than Reagan:

> diversity(sentences$speech, sentences$pres)
pres wc simpson shannon collision berger_parker brillouin
1 reagan 2732 0.998 6.653 5.896 0.025 6.104
2 TR 12071 0.999 7.491 6.659 0.011 7.101

One of my favorite plots is the dispersion plot. This shows the dispersion of a word throughout the text. Let's examine the dispersion of "peace", "government", and "marksmanship":

> dispersion_plot(
sentences$speech,
rm.vars = sentences$pres,
c("peace", "government", "marksmanship"),
color = "black",
bg.color = "white"
)

The output of the preceding command is as follows:

This is quite interesting as you can visualize how much longer TR's address is, as well as how he structured it to discuss foreign affairs later in the text. We can gain some insight into TR's mind with his discussion on marksmanship as he was looking at Switzerland as a shining example of how a populace could be armed and trained. You can see and understand how text analysis can provide insight into what someone is thinking, what their priorities are, and how they go about communicating them.

This completes our analysis of the two speeches. It provided some insight on to how the topics and speech formats have changed over time to accommodate political necessity. Keep in mind that this code can be adapted to text for dozens, if not hundreds, of documents and with multiple speakers, for example, screenplays, legal proceedings, interviews, social media, and so on. Indeed, text mining can bring quantitative order to what has been qualitative chaos.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset