Dictionary based scoring

As described in the steps we outlined for our approach, let us use a sentiment dictionary to score our initially extracted tweets. We are going to leverage the sentimentr R package to learn the sentiments of the tweets we have collected.

Let us see how to score using the sentiment function from the sentimentr package:

> library(sentimentr, quietly = TRUE)
> sentiment.score <- sentiment(tweet.df$text)
> head(sentiment.score)
   element_id sentence_id word_count  sentiment
1:          1           1          8  0.0000000
2:          2           1          8  0.3535534
3:          3           1          3  0.0000000
4:          3           2          4  0.0000000
5:          3           3          7  0.0000000
6:          4           1         14 -0.8418729

The sentiment function in sentimentr calculates a score between -1 and 1 for each of the tweets. In fact, if a tweet has multiple sentences, it will calculate the score for each sentence. A score of -1 indicates that the sentence has very negative polarity. A score of 1 means that the sentence is very positive. A score of 0 refers to the neutral nature of the sentence.

Refer to https://cran.r-project.org/web/packages/sentimentr/index.html for more details about the sentimentr package.

However, we need the score to beat the tweet level and not at the sentence level. We can take the average value of the score across all the sentences in a tweet.

Calculate the average value of sentiment.scores for each tweet:

> library(dplyr, quietly = TRUE)
> sentiment.score <- sentiment.score %>% group_by(element_id) %>% summarise(sentiment = mean(sentiment))
> head(sentiment.score)
# A tibble: 6 x 2
  element_id  sentiment
       <int>      <dbl>
1          1  0.0000000
2          2  0.3535534
3          3  0.0000000
4          4 -0.4209365
5          5  0.0000000
6          6  0.0000000

Here, the element_id refers to the individual tweet. By grouping by element_id and calculating the average, we can get the sentiment score at the tweet level.

We now have the scores for each tweet.

Let us now add the sentiment to our original tweet.df. data frame. Going forward, we only need the text and its sentiment; let us subset those columns from tweet.df:

> tweet.df$polarity <- sentiment.score$sentiment
> tweet.final <- tweet.df[,c('text','polarity')]

We have our dataset prepared now, with the tweet text and the sentiment score.

Our sentiment is still a real value. Let us convert it to a categorical variable.

We remove all records with a polarity value of 0. These are records with neutral sentiments. If our polarity is less than zero, we mark the tweet as negative, otherwise, it is positive:

> tweet.final <- tweet.final[tweet.final$polarity != 0, ]
> tweet.final$sentiment <- ifelse(tweet.final$polarity < 0, "Negative","Positive")
> tweet.final$sentiment <- as.factor(tweet.final$sentiment)
> table(tweet.final$sentiment)

Negative Positive 
     200      168

Using ifelse, we have discretized the real-valued sentiment score into a binary variable. We added an identifier column. With that, we have the training data ready. Finally, the table command shows us the class distribution. Our class distribution is imbalanced.

Let us say we are building a discriminative model such as logistic regression or SVM, where the model is trying to learn a boundary between the classes. It expects an equal number of positive classes and negative classes. If the number of positive classes is greater than or less than the number of negative classes in the dataset, we have a class imbalance problem.

There are several techniques to balance the dataset. Some of them are downsampling, upsampling, SMOTE, and so on. We will leverage the function upSample in the caret package to create more records in the minority class. This should produce better results in the classification models.

Manage the class distribution:

> library(caret, quietly = TRUE)
> tweet.balanced <- upSample(x = tweet.final$text, y = tweet.final$sentiment)
> names(tweet.balanced) <- c('text', 'sentiment')
> table(tweet.balanced$sentiment)

Negative Positive 
     200      200

The upSample function looks at the class distribution and repeatedly samples from the Positive class to balance. The final table command shows the new class distribution.

Refer to the caret R package for class imbalance problems: https://topepo.github.io/caret/subsampling-for-class-imbalances.html

In this chapter, we are going to learn the distribution of positive and negative classes independently and use them for our predictions. Hence, we don't need to do balance the classes.

Finally, we add an ID column to our dataset:

> tweet.final$id <- seq(1, nrow(tweet.final))

Before we move on to the next section, let us spend some time understanding the inner workings of our dictionary-based sentiment function. The sentiment function utilizes a sentiment lexicon (Jockers, 2017) from the lexicon package. It preprocesses the given text as follows:

Paragraphs are split into sentences
Sentences are split into words
All punctuation is removed except commas, semicolons, and colons
Finally, words are stored as tuples, for example, w_{5,2,3} means the third word in the second sentence of the fifth paragraph

Each word is looked up in the lexicon, and positive and negative words are tagged with +1 and -1 respectively. Let us call the words which have received a score as the polarized words. Not all words will receive a score. Only those found in the lexicons receive a score. We can pass a customer lexicon through the polarity_dt parameter to the sentiment function.

For each of the polarized words, n words before them and n words after them are considered, and together they are called polarized context clusters. The parameter n can be set by the user. The words in the polarized context cluster can be tagged as either of the following:

Neutral
Negator
Amplifier
De-amplifier
Adversative conjunctions

A dictionary of these words can be passed through the valence_shifter_dt parameter. Looking up this dictionary, the neighboring words can be tagged. The weights for these are passed through the amplifier.weight and adversative.weight parameters.

Each polarized word is weighted now based on polarity_dt, and also weighted based on the number of valence shifters/words surrounding it, which are tagged either as amplifiers or adversative conjunctions. Neutrally tagged weights have no weights.

For more details about the weight and scoring, refer to R help for the sentiment function.

Table of Contents for Dictionary based scoring

Create new playlist

Sign In

Sign Up

Table of Contents for
Dictionary based scoring