Data preparation

We'll start by getting the pertinent data period. Then, we'll take a look at a table of the labels:

> sotu_party <- sotu_meta %>%
    dplyr::filter(year > 1899)

> table(sotu_party$party)

Democratic Republican 
        61         64

The class is well balanced.

A few things can help in the modeling process. It is a good idea here to remove numbers, remove capitalization, remove stop words, stem the words, and remove punctuation. The built-in functions from the tm package are handy for this, and we can apply it to a column in the data frame:

> sotu_tidy_party$word <- tm::removeNumbers(sotu_tidy_party$word)

> sotu_tidy_party$word <- tm::removePunctuation(sotu_tidy_party$word)

> sotu_party$text <- tolower(sotu_party$text)

> sotu_tidy_party$word <- tm::stemDocument(sotu_tidy_party$word)

> sotu_party$text <- tm::removeWords(sotu_party$text, tm::stopwords("en"))

Now we can go ahead and create train and test datasets using caret as before:

> set.seed(222)

> index <- caret::createDataPartition(sotu_party$party, p = 0.8, list = F)

> train <- sotu_party[index, ]

> test <- sotu_party[-index, ]

The objective now is to create a word-based tokenizer function for the training data. It is also important to specify a document ID, which will be the column values for a year. We will apply this function to our test data as well:

> tok_fun = word_tokenizer

> it_train = text2vec::itoken(
    train$text,
    tokenizer = tok_fun,
    ids = train$year,
    progressbar = FALSE
 )

Now the create_vocabulary() function will create a data frame of the word, its total count, and the number of documents in which it appears:

> vocab = text2vec::create_vocabulary(it_train)

This produces data with 13,541 words. A consideration is to what extent you want to remove sparse words, even before doing anything else. In this example, if we remove any word that occurs less than four times, the number of words is reduced to 5,321:

> pruned <- text2vec::prune_vocabulary(vocab, term_count_min = 4)

Before creating the DTM, you must create an object of how to map the text to the indices. This is done with the vocab_vectorizer() function:

> vectorizer = text2vec::vocab_vectorizer(pruned)

We now create the DTM with the structure of a sparse matrix:

> dtm_train = text2vec::create_dtm(it_train, vectorizer)

> dim(dtm_train)
[1] 101 5321

You can see that the matrix has 101 observations corresponding to each year in training data and a column for each word. The final transformation prior to modeling is to turn the raw counts in the matrix to tf-idf values. This acts as a type of data normalization by identifying how important a word is in a specific document relative to its overall frequency in all documents. The calculation is to divide the frequency of a word in a document by the total number of words in that document (tf). Then this is multiplied by the log(number of documents/number of documents containing word), which is the idf. Said another way, it adjusts the frequency of a term in a document based on how rarely it is used overall.

We do this by defining the tf-idf model to use and apply that to the training data:

> tfidf = text2vec::TfIdf$new()

> dtm_train_tfidf = text2vec::fit_transform(dtm_train, tfidf)

You can apply this process to the test data in a similar fashion:

> it_test = text2vec::itoken(
 test$text,
 tokenizer = tok_fun,
 ids = test$year,
 progressbar = FALSE
 )

> dtm_test_tfidf = text2vec::create_dtm(it_test, vectorizer)

> dtm_test_tfidf = transform(dtm_test_tfidf, tfidf)

We now have our feature space created to begin classification modeling.

Table of Contents for Data preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preparation