Building a text sentiment classifier with the BoW approach

The intent of the BoW approach is to convert the review text provided into a matrix form. It represents documents as a set of distinct words by ignoring the order and meaning of the words. Each row of the matrix represents each review (otherwise called a document in NLP), and the columns represent the universal set of words present in all the reviews. For each document, and across each word, the existence of the word, or the frequency of the word occurrence, in that specific document is recorded. Finally, the matrix created from word frequency vectors represents the documents set. This methodology is used to create input datasets that are required to train the models, and also to prepare the test dataset that need to be used by the trained models to perform text classification. Now that we understand the BoW motivation, let's jump into implementing the steps to build a sentiment analysis classifier based on this approach, as shown in the following code block:

# including the required libraries
library(SnowballC)
library(tm)
# setting the working directory where the text reviews dataset is located
# recollect that we pre-processed and transformed the raw dataset format
setwd('/home/sunil/Desktop/sentiment_analysis/')
# reading the transformed file as a dataframe
text <- read.table(file='Sentiment Analysis Dataset.csv', sep=',',header = TRUE)
# checking the dataframe to confirm everything is in tact
print(dim(text))
View(text)

This will result in the following output:

> print(dim(text))
[1] 1000 2
> View(text)

The first step in processing text data involves creating a corpus, which is a collection of text documents. The VCorpus function in the tm package enables conversion of the reviews comments column in the data frame into a volatile corpus. This can be achieved through the following code:

# transforming the text into volatile corpus
train_corp = VCorpus(VectorSource(text$SentimentText))
print(train_corp)

This will result in the following output:

> print(train_corp)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1000

From the volatile corpus, we create a Document Term Matrix (DTM). A DTM is a sparse matrix that is created using the tm library's DocumentTermMatrix function. The rows of the matrix indicate documents and the columns indicate features, that is, words. The matrix is sparse because all unique unigram sets of the dataset become columns in DTM and, as each review comment does not have all elements of the unigram set, most cells will have a 0, indicating the absence of the unigram. 

While it is possible to extract n-grams (unigrams, bigrams, trigrams, and so on) as part of the BoW approach, the tokenize parameter can be set and passed as part of the control list in the DocumentTermMatrix function to accomplish n-grams in DTM. It must be noted that using n-grams as part of the DTM creates a very high number of columns in the DTM. This is one of the demerits of the BoW approach, and, in some cases, it could stall the execution of the project due to limited memory. As our specific case is also limited by hardware infrastructure, we restrict ourselves by including only the unigrams in DTM in this project. Apart from just generating unigrams, we also perform some additional processing on the reviews text document by passing parameters to the control list in the tm library's DocumentTermMatrix function. The processing we do on the review text documents during the creation of the DTM is given here:

  1. Change the case of the text to lowercase.
  2. Remove any numbers.
  3. Remove stop words using the English language stop word list from the Snowball stemmer project.  Stop words are common words, such as a, an, in, and the, that do not add value in deciding sentiment based on review comments.
  4. Remove punctuation.
  5. Perform stemming, which aims at resolving a word into the base form of the word, that is, strip the plural s from nouns, the ing from verbs, or other affixes. A stem is a natural group of words with equal or very similar meaning. After the stemming process, every word is represented by its stem. The SnowballC library provides the capability to obtain the root for each of the words in the review comments.

Let's now create a DTM from the volatile corpus and do the text preprocessing with the following code block:

# creating document term matrix
dtm_train <- DocumentTermMatrix(train_corp, control = list(
tolower = TRUE,removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE
))
# Basic EDA on dtm
inspect(dtm_train)

This will result in the following output:

> inspect(dtm_train)
<<DocumentTermMatrix (documents: 1000, terms: 5794)>>
Non-/sparse entries: 34494/5759506
Sparsity : 99%
Maximal term length: 21
Weighting : term frequency (tf)
Sample :
Terms
Docs book can get great just like love one read time
111 0 3 2 0 0 0 2 1 0 2
162 4 1 0 0 0 1 0 0 1 0
190 0 0 0 0 0 0 0 0 0 0
230 0 1 1 0 0 0 1 0 0 0
304 0 0 0 0 0 3 0 2 0 0
399 0 0 0 0 0 0 0 0 0 0
431 9 1 0 0 0 1 2 0 0 1
456 1 0 0 0 0 0 0 1 2 0
618 0 2 3 1 4 1 3 1 0 1
72 0 0 1 0 2 0 0 1 0 1

We see from the output that there are 1,000 documents that were processed and form rows in the matrix. There are 5,794 columns representing unique unigrams from the reviews following the additional text processing. We also see that the DTM is 99% sparse and consists of non-zero entries only in 34,494 cells. The non-zero cells represent the frequency of occurrence of the word on the column in the document represent on the row of the DTM. The weighting is done through the default 'term frequency' weighting, as we did not specify any weighting parameter in the control list supplied to the DocumentTermMatrix function. Other forms of weighting, such as term frequency-inverse document frequency (TFIDF), are also possible just by passing the appropriate weight parameter in the control list to the DocumentTermMatrix function. For now, we will stick to weighting based on term frequency, which is the default. We also see from the inspect function that several sample documents were output along with the term frequencies in these documents.

The DTM tends to get very big, even for normal sized datasets. Removing sparse terms, that is, terms occurring only in very few documents, is the technique that can be tried to reduce the size of the matrix without losing significant relations inherent to the matrix. Let's remove sparse columns from the matrix. We will attempt to remove those terms that have at least a 99% of sparse elements with the following line of code:

# Removing sparse terms
dtm_train= removeSparseTerms(dtm_train, 0.99)
inspect(dtm_train)

This will result in the following output:

> inspect(dtm_train)
<<DocumentTermMatrix (documents: 1000, terms: 686)>>
Non-/sparse entries: 23204/662796
Sparsity : 97%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs book can get great just like love one read time
174 0 0 1 1 1 2 0 2 0 1
304 0 0 0 0 0 3 0 2 0 0
355 3 0 0 0 1 1 2 3 1 0
380 4 1 0 0 1 0 0 1 0 2
465 5 0 1 1 0 0 0 2 6 0
618 0 2 3 1 4 1 3 1 0 1
72 0 0 1 0 2 0 0 1 0 1
836 1 0 0 0 0 3 0 0 5 1
866 8 0 1 0 0 1 0 0 4 0
959 0 0 2 1 1 0 0 2 0 1

We now see from the output of the inspect function that the sparsity of the matrix is reduced to 97%, and the number of unigrams (columns of the matrix) is reduced to 686. We are now ready with the DTM that can be used for training with any machine learning classification algorithm. In the next few lines of code, let's attempt to divide our DTM into training and test dataset:

# splitting the train and test DTM
dtm_train_train <- dtm_train[1:800, ]
dtm_train_test <- dtm_train[801:1000, ]
dtm_train_train_labels <- as.factor(as.character(text[1:800, ]$Sentiment))
dtm_train_test_labels <- as.factor(as.character(text[801:1000, ]$Sentiment))

We will be using a machine learning algorithm called Naive Bayes to create a model. Naive Bayes is generally trained on data with nominal features. We can observe that the cells in our DTM are numeric and therefore need to be converted to nominal prior to feeding the dataset as input for creating the model with Naive Bayes. As each cell indicates the word frequency in the review, and as the number of times a word used in the review does not impact sentiment, let's write a function to convert the cell values with a non-zero value to Y, and in case of a zero, let's convert it to N, with the following code:

cellconvert<- function(x) {
x <- ifelse(x > 0, "Y", "N")
}

Now, let's apply the function on all rows of the training dataset, and test dataset we have previously created in this project with the following code:

# applying the function to rows in training and test datasets
dtm_train_train <- apply(dtm_train_train, MARGIN = 2,cellconvert)
dtm_train_test <- apply(dtm_train_test, MARGIN = 2,cellconvert)
# inspecting the train dtm to confirm all is in tact
View(dtm_train_train)

This will result in the following output:

We can see from the output that all the cells in the training and test DTMs are now converted to nominal values. Thus, let's proceed to build a text sentiment analysis classifier using the Naive Bayes algorithm from the e1071 library, as follows:

# training the naive bayes classifier on the training dtm
library(e1071)
nb_senti_classifier=naiveBayes(dtm_train_train,dtm_train_train_labels)
# printing the summary of the model created
summary(nb_senti_classifier)

This will result in the following output:

> summary(nb_senti_classifier)
Length Class Mode
apriori 2 table numeric
tables 686 -none- list
levels 2 -none- character
call 3 -none- call

The preceding summary output shows that the nb_senti_classifier object is successfully created from the training DTM. Let's now use the model object to predict sentiment on the test data DTM. In the following code block, we are instructing that the predictions should be classes and not prediction probabilities:

# making predictions on the test data dtm
nb_predicts<-predict(nb_senti_classifier, dtm_train_test,type="class")
# printing the predictions from the model
print(nb_predicts)

This will result in the following output:

[1] 1 1 2 1 1 1 1 1 1 2 2 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 2 2 2 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 1 1 1 2
Levels: 1 2

With the following code, let us now compute the accuracy of the model using the mmetric function in the rminer library:

# computing accuracy of the model
library(rminer)
print(mmetric(nb_predicts, dtm_train_test_labels, c("ACC")))

This will result in the following output:

[1] 79

We achieved a 79% accuracy just with a very quick and basic BoW model. The model can be further improved by means of techniques such as parameter tuning, lemmatization, new features creation, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset