Building a sentiment classifier

In the beginning of the chapter we devoted a section to understand kernel density estimation and how it can be leveraged to approximate the probability density function for the given samples from a random variable. We are going to use it in this section.

We have a set of tweets positively labeled. Another set of tweets negatively labeled. The idea is to learn the PDF of these two data sets independently using kernel density estimation.

From Bayes rule, we know that

P(Label | x)  =  P(x| label) * P(label) / P(x)

Here, P(x | label) is the likelihood, P(label) is prior, and P(x) is the evidence. Here the label can be positive sentiment or negative sentiment.

Using the PDF learned from kernel density estimation, we can easily calculate the likelihood, P(x | label)

From our class distribution, we know the prior P(label)

For any new tweet, we can now calculate using the Bayes Rule,

P(Label = Positive | words and their delta tfidf weights)
P(Label = Negative | words and their delta tfidf weights)

Viola, we have our sentiment classifier assembled based on kernel density estimation.

Package naivebayes, provides a function naive_bayes, where we can use a kernel to build a classifier as we have proposed.

Before we jump into our use case, let us see quickly how KDE can be leveraged to classify the Iris data set. More information about this data set is available in  https://archive.ics.uci.edu/ml/datasets/iris.

Iris dataset and KDE classifier:

library(naivebayes)
data(iris)

iris.model <- naive_bayes(x = iris[,1:4], y= iris$Species, usekernel = TRUE)
plot(iris.model)

We pass the first 4 columns as our features, x and the last column species as our y variable. There are three classes in this dataset. There is a total of 150 records, 50 records per each class.

Let us look at the plot of our model:

We have a plot for every column in iris dataset. Let us look at the petal.width column. There are three density plots each representing, P(Petal.width, setosa), P(Petal.width, versicolor), and finally P(Petal.width, virginica) one for each class variable. Each of them represents the underlying distribution of Petal.width for each one of the classes.

Other plots can be interpreted similarly. Using this PDF we can now classify the data into one of the three classes. Hopefully, this gives an idea of how the underlying distribution discovered by KDE can be used to separate the records into their respective class instance.

We are going to apply the same principle to our tweets.

Let us proceed to build our classifier:

model <- naive_bayes(x = dtm.mat, y = tweet.final$sentiment, usekernel = TRUE)

We use the naivebayes R package. The function naive_bayes is used to build the model. You can see that we have set the useKernel parameter to TRUE. This informs the function to use KDE for calculating the likelihood.

To understand the model constructed, you can run the following:

str(model)

This will help you view the various properties of this model.

Having built the model, we can use the standard predict function to predict the label for unknown tweets.

The prediction using our model is as follows:

preds <- predict(model, newdata = dtm.mat, type = "class")

library(caret)
confusionMatrix(preds, tweet.final$sentiment)

Let us look at the confusion matrix output:

Confusion Matrix and Statistics

Reference
Prediction Negative Positive
Negative 128 58
Positive 72 110

Accuracy : 0.6467
95% CI : (0.5955, 0.6956)
No Information Rate : 0.5435
P-Value [Acc > NIR] : 3.744e-05

Kappa : 0.2928
Mcnemar's Test P-Value : 0.2542

Sensitivity : 0.6400
Specificity : 0.6548
Pos Pred Value : 0.6882
Neg Pred Value : 0.6044
Prevalence : 0.5435
Detection Rate : 0.3478
Detection Prevalence : 0.5054
Balanced Accuracy : 0.6474

'Positive' Class : Negative

We have an accuracy of 64%. Can we understand why we reached this number? A good way to investigate this is to look at the features fed to this model. In this case, the words we have fed as features. Looking at the PDF of words for both positive and negative sentiment, we should be able to throw some light on our model performance.

Let us look at some of the variables and their PDF for positive and negative sentiments:

The graph represents a subset of words used for classification. It is evident that underlying PDF estimated by kernel density estimate is not much different for the positive and the negative classes. PDF for positive and negative are represented by the red and green line. In most of the cases both the PDF are overlapping each other. Compare these plots to the plot we generated for the Iris dataset, where the PDFs were distinctly separate. Hence the model does not have enough classification power.

What can we do to improve the model?

  1. Collect more data.
  2. To create our training set, we did an unsupervised approach of using a dictionary to get the sentiment. Users can try using other lexicons.
  3. Once the dictionary based approach returns the result, we can then manually curate it and make it more robust.

That brings us to the end of the section. We have shown you a simple Naive Bayes classifier using KDE estimates for calculating the likelihood. You can go ahead and use other classification methods and try to compare the results. Another good exercise would be to compare the normal TFIDF features to the Delta TFIDF features.

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset