Now that we are equipped with the key terms and concepts from the world of Sentiment Analysis, let us put our theory to the test. We have seen some major application areas for Sentiment Analysis and the challenges faced, in general, to perform such analytics. In this section we will perform Sentiment Analysis categorized into:
R has a very robust library for the extraction and manipulation of information from Twitter called TwitteR
. As we saw in the previous chapter, we first need to create an application using Twitter's application management console before we can use TwitteR or any other library for sentiment analysis. For this chapter, we will be reusing the application from the previous chapter (keep your application keys and secrets handy). Also, in the coming sections, we will be utilizing our code from previous chapters in a more structured format to enable reuse and to follow #bestCodingPractices
.
Before we begin our analysis, let us first restructure our existing code and write some helper functions, which will come in handy later on. As we know, data from Twitter can be extracted using search terms or from a user's timeline. The following two helper functions help us to do exactly the same tasks in a reusable fashion:
#extract search tweets extractTweets <- function(searchTerm,tweetCount){ # search term tweets tweets = searchTwitter(searchTerm,n=tweetCount) tweets.df = twListToDF(tweets) tweets.df$text <- sapply(tweets.df$text,function(x) iconv(x,to='UTF-8')) return(tweets.df) } #extract timeline tweets extractTimelineTweets <- function(username,tweetCount){ # timeline tweets twitterUser <- getUser(username) tweets = userTimeline(twitterUser,n=tweetCount) tweets.df = twListToDF(tweets) tweets.df$text <- sapply(tweets.df$text,function(x) iconv(x,to='UTF-8')) return(tweets.df) }
The function extractTweets
takes the search
term and number of tweets to be extracted as inputs and returns the data in a data frame which contains text converted to UTF8 encoding. Similarly, the function extractTimelineTweets
takes the username and number of tweets as inputs and returns data in a data frame with the text converted to UTF8 encoding. Therefore, the preceding two functions will help us to extract tweets multiple times (based on different search
terms or users) without rewriting the same lines of code again and again.
Continuing with the same theme, we will write another helper function to clean and transform our data set. As we saw in the previous chapter, R's tm
library provides us with various utility functions to quickly clean and transform text corpus. In this function, we will make use of tm_map
to transform our tweets:
# clean and transform tweets transformTweets <- function(tweetDF){ tweetCorpus <- Corpus(VectorSource(tweetDF$text)) tweetCorpus <- tm_map(tweetCorpus, tolower) tweetCorpus <- tm_map(tweetCorpus, removePunctuation) tweetCorpus <- tm_map(tweetCorpus, removeNumbers) # remove URLs removeURL <- function(x) gsub("http://[[:alnum:]]*", "", x) tweetCorpus <- tm_map(tweetCorpus, removeURL) # remove stop words twtrStopWords <- c(stopwords("english"),'rt','http','https') tweetCorpus <- tm_map(tweetCorpus, removeWords, twtrStopWords) tweetCorpus <- tm_map(tweetCorpus, PlainTextDocument) #convert back to dataframe tweetDataframe <- data.frame(text=unlist(sapply(tweetCorpus, `[`, "content")), stringsAsFactors=F) #split each doc into words splitText <- function(x) { word.list = str_split(x, '\s+') words = unlist(word.list) } # attach list of words to the data frame tweetDataframe$wordList = sapply( tweetDataframe$text, function(text) splitText(text)) return (tweetDataframe) }
In addition to the usual transformations, such as stop word removal, change to lower case, punctuation removal, and so on, the function transformTweets
tokenizes each tweet at word level and attaches the list of words in each tweet to the object. Also, the function returns the transformed tweets in a data frame for further manipulation.
Polarity, as discussed in the section Key Concepts, is the positive, negative or neutral classification of the piece of text in consideration. The class labels may change depending upon the context (liked versus disliked or favorable versus unfavorable). Polarity may also have a degree attached to it which places the analyzed text on a continuous (or discrete) scale of polarities (say from -5
to 5
). This degree of polarity helps us analyze the extent (or degree) of positivity (or negativity) in the text. This is particularly useful in comparative studies as we have the opportunity to view analyzed text with reference to certain benchmarks.
In this section, we will analyze tweets and score each of them based on the polar words identified in each of the tweets. The simple and easy-to-code algorithm is outlined in the following steps:
The preceding steps are represented diagrammatically as follows:
Once each tweet in the dataset has been scored, we may aggregate the scores to understand the overall sentiment distribution related to the search terms or Twitter handle. Positive values define a positive sentiment; larger numbers denote a greater degree of positivity, and similarly for negative sentiments. A neutral stance is represented by a score of 0. For example, This car is amazingly fast and beautiful has a greater degree of positivity than This is a nice car, though both are positive sentences.
Let us use this algorithm to analyze sentiments using search terms and Twitter handles. As discussed previously, opinion mining has become essential, not just for brands but for governments as well. Every entity out there wants to gauge how its target audience feels about it and its initiatives, and governments are no exception. Of late, the Indian Government has been utilizing Twitter and other social media platforms effectively to reach its audience and make them aware about its initiatives and policies. One such initiative is the recently launched Make in India initiative. Consider a scenario where one is tasked with analyzing the effectiveness of and public opinion related to such an initiative. To analyze public opinion, which changes dynamically over time, Twitter would be a good choice. So, to analyze sentiments for the Make in India initiative, let us analyze some tweets.
As previously outlined, we start by connecting to Twitter and extracting tweets related to the search term Make In India. This is followed by the preprocessing step, where we remove stop words, URLs, and so on to transform the tweets into a usable format. We also tokenize each tweet into a list of constituent words for use in the coming steps. Once our dataset is ready and in a consumable format, we load the precompiled list of positive and negative words. The list is available from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html.
We first write a reusable analyzeTrendSentiments
function which takes the search term and number of tweets to be extracted as inputs. It makes use of the functions extractTweets
and
transformTweets
to get the job done:
analyzeTrendSentiments <- function(search,tweetCount){ #extract tweets tweetsDF <- extractTweets(search,tweetCount) # transformations transformedTweetsDF <- transformTweets(tweetsDF) #score the words transformedTweetsDF$sentiScore = sapply(transformedTweetsDF$wordList,function(wordList) scoreTweet(wordList)) transformedTweetsDF$search <- search return(transformedTweetsDF) }
We then use the function analyzeTrendSentiments
to get a data frame consisting of tweets scored using a precompiled list of polar words. We use twitteR
, ggplot2
, stringr
and tm
libraries as well:
library(twitteR) library(stringr) library(tm) library(ggplot2) consumerSecret = "XXXXXXXXXX" consumerKey = "XXXXXXXXXXXXXXXXXXXXXXXXX" setup_twitter_oauth(consumer_key = consumerKey,consumer_secret = consumerSecret) # list of positive/negative words from opinion lexicon pos.words = scan(file= 'positive-words.txt', what='character', comment.char=';') neg.words = scan(file= 'negative-words.txt', what='character', comment.char=';') #extract 1500 tweets on the given topic makeInIndiaSentiments <- analyzeTrendSentiments("makeinindia",1500) #plot the aggregated scores on a histogram qplot(makeInIndiaSentiments $sentiScore)
In the last chapter, we learned and used different visualizations to grasp the insights hidden in our analysis. Continuing with the same thought process, we generate a histogram of aggregated scores. The visualization looks like this:
The histogram is easy to interpret. It shows the tweets distributed across a polarity scale on the x-axis and frequency of tweets on the y-axis. The results show a normal distribution with a general tilt towards the positive side. It seems the initiative is getting a positive response from its audience.
Going a bit deeper into the analysis itself, let us analyze the sentiments for the same search term and see how the opinions change over time.
The tweets for this analysis were extracted on the day the initiative was launched as well as a day later. Your results may vary due to the dynamic nature of Twitter. You may observe a difference in outcomes across other examples in this chapter as well. We urge you to be creative and try out other trending topics while working through examples from this chapter.
The output looks like this:
The preceding two histograms show a shift in opinions over the course of two days. If you were following the news at the time, in one of the events for this initiative a sudden fire broke out and burnt the whole stage. The graph on top is based upon tweets after the fire broke out while the graph labeled makeinindia_yday refers to the tweets from the day before. Though the shift in sentiments isn't drastic, it is clearly visible that the shift has been more towards the positive side (some tweets are even hitting a score of 6+). Could this be because tweeple started praising the emergency teams and police for their quick action in preventing casualties? Well, it looks like Twitter isn't just about people ranting on random stuff!
World leaders
Twitter has caught the frenzy of celebrities and politicians alike. As a quick exercise, try analyzing tweets from the twitter handles of world leaders such as @potus
, @pmoindia
, and @number10gov
to see what kind of opinions our leaders project through Twitter. Don't be surprised if their timelines are neutral...oops, diplomatic!
A classification problem requires the labeling of input data into required classes based on some defined characteristics of each class (see Chapter 2, Let's Help Machines Learn, for details). In the case of sentiment analysis, the classes are positive and negative (or neutral in certain cases). We have learned about different classification algorithms and seen how they are used across domains to solve categorization and classification problems in the previous chapters, and sentiment analysis is yet another domain where these algorithms are highly useful.
In this section, we will perform opinion mining using classification algorithms such as SVM and boosting. We will also touch upon ensemble methods and see how they help to improve performance. Note that, for this section, we will concentrate only on the positive and negative polarities, but the approach is generic enough to be easily extended to include the neutral polarity as well.
Since this is a supervised learning approach, we require labeled data for training and testing the performance of our algorithms. For the purpose of this chapter, we will utilize a labeled dataset from http://www.sentiment140.com/. It contains tweets labeled as 0, 2, and 4 for negative, neutral and positive sentiments, respectively. There are various attributes such as tweet ID
, date
, search query
, username
, and the tweet text
, apart from the sentiment label. For our case we will be considering only the tweet text and its corresponding label.
Another source of labeled tweets is available at https://github.com/guyz/twitter-sentiment-dataset. This source makes use of a python script to download around 5000 labeled tweets keeping Twitter API guidelines in mind.
Before we get into the algorithm-specific details, let us look into the labeled dataset and perform the initial steps of collecting and transforming our data into the required forms. We will make use of libraries such as caret
and RTextTools
for these steps.
As mentioned previously, the dataset contains polarities labeled as 0, 2, and 4 for negative, neutral, and positive. We will load the csv
file in R and apply a quick transformation to change the labels to positive and negative. Once the polarities have been transformed into intelligible names, we will filter out the rows of data containing neutral sentiments. Also, we will keep only the columns for polarity and tweet text, and remove the rest.
# load labeled dataset labeledDSFilePath = "labeled_tweets.csv" labeledDataset = read.csv(labeledDSFilePath, header = FALSE) # transform polarity labels labeledDataset$V1 = sapply(labeledDataset$V1, function(x) if(x==4) x <- "positive" else if(x==0) x<-"negative" else x<- "none") #select required columns only requiredColumns <- c("V1","V6") # extract only positive/negative labeled tweets tweets<-as.matrix(labeledDataset[labeledDataset$V1 %in% c("positive","negative") ,requiredColumns])
The tweets
object is now available as a matrix with each row representing a tweet, and with columns referring to polarity and tweet text. Before we transform this matrix into the formats required by the classification algorithms, we need to split our data into training and testing datasets (see Chapter 2, Let's Help Machines Learn, for more on this). Since both the training and testing datasets should contain a good enough distribution of samples of all classes for the purposes of training and testing, we use the createDataPartition
function available from the caret
package. For our use case, we split our data into 70/30 training and testing datasets:
indexes <- createDataPartition(tweets[,1], p=0.7, list = FALSE) train.data <- tweets[indexes,] test.data <- tweets[-indexes,]
We perform a quick check to see how our data is split across the positive and negative classes in our original dataset, and the training and testing datasets. You will see the result in the following screenshot:
As we can see, createDataPartition
has done a nice job of maintaining a similar sentiment distribution across the training and testing datasets.
Next in the line of transformations is the Document Term Matrix transformation. As we have seen in Chapter 7, Social Media Analysis – Analyzing Twitter Data, a document term matrix transforms a given dataset into rows representing the documents and columns of terms (words/sentences). Unlike the previous chapter, where we used the tm
library's DocumentTermMatrix
function for transformation and applied various transformations using tm_map
, for the current use case we will use the create_matrix
function from the RTextTools
library. This function is an abstraction over tm
's corresponding functions. We will also assign weights to each of the terms using tfidf
as our feature. The create_matrix
method also helps us take care of splitting sentences into words, stop words and number removal, and stemming them as well. Here's how you do it:
train.dtMatrix <- create_matrix(train.data[,2], language="english" , removeStopwords=TRUE, removeNumbers=TRUE, stemWords=TRUE, weighting = tm::weightTfIdf) test.dtMatrix <- create_matrix(test.data[,2], language="english" , removeStopwords=TRUE, removeNumbers=TRUE, stemWords=TRUE, weighting = tm::weightTfIdf, originalMatrix=train.dtMatrix) test.data.size <- nrow(test.data)
The create_matrix
method in RTextTools v1.4.2
has a small bug which prevents weight assignment when using originalMatrix
option. The following small hack can be used to fix the issue till the library gets updated:
> trace("create_matrix",edit=T)
Scroll to line 42 and update Acronym to acronym.
Check the following links for more details and alternate ways of handling this issue:
https://github.com/timjurka/RTextTools/issues/4
http://stackoverflow.com/questions/16630627/recreate-same-document-term-matrix-with-new-data
Now that we have both the training and testing datasets in the DocumentTermMatrix
format, we can proceed towards the classification algorithms and let our machines learn and build sentiment classifiers!
Support Vector Machines, or SVM as they are commonly known, are one of the most versatile classes of supervised learning algorithms for classification. An SVM builds a model in such a way that the data points belonging to different classes are separated by a clear gap, which is optimized such that the distance of separation is the maximum possible. The samples on the margins are called the support vectors, which are separated by a hyperplane (see Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics for more details).
Since our current use case for sentiment analysis is also a binary (positive and negative) classification problem, SVM helps us build a model using the training dataset, which separates tweets into positive and negative sentiment classes, respectively.
We will use e1071
library's svm
function to build a sentiment classifier. We start off with the default values for the SVM classifier available from the library and follow the same iterative procedure we did in Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, to finally arrive at the best classifier. The following snippet of code builds a sentiment classifier using the default values and then prints a confusion matrix, along with other statistics for evaluation, as shown in the following code snippet:
svm.model <- svm(train.dtMatrix, as.factor(train.data[,1])) ## view inital model details summary(svm.model) ## predict and evaluate results svm.predictions <- predict(svm.model, test.dtMatrix) true.labels <- as.factor(test.data[,1]) confusionMatrix(data=svm.predictions, reference=true.labels, positive="positive")
The confusion matrix generated as follows shows that the classifier has just 50% accuracy, which is as bad as a coin toss, with no predictions for negative sentiments whatsoever! It seems like the classifier couldn't infer or learn much from the training dataset.
To build a better-performing model, we will now go under the hood and tweak some parameters. The svm
implementation from e1071
provides us with a wonderful utility called tune
to obtain the optimized values of hyperparameters using a grid search over the given parameter ranges:
## hyperparameter optimizations # run grid search cost.weights <- c(0.1, 10, 100) gamma.weights <- c(0.01, 0.25, 0.5, 1) tuning.results <- tune(svm, train.dtMatrix, as.factor(train.data[,1]), kernel="radial", ranges=list(cost=cost.weights, gamma=gamma.weights)) # view optimization results print(tuning.results) # plot results plot(tuning.results, cex.main=0.6, cex.lab=0.8,xaxs="i", yaxs="i")
In the code snippet above, we have utilized the radial bias kernel (or rbf for short) for hyperparameter optimization. The motivation for using rbf was due to its better performance with respect to specificity and sensitivity even though the overall accuracy was comparable to linear kernels. We urge our readers to try out linear kernels and observe the difference in the overall results. Please note that, for text classification, linear kernels usually perform better than other kernels, not only in terms of accuracy but in performance as well
The parameter-tuning results in optimized values for hyperparameters cost
and gamma
as 10
and 0.01
, respectively; the following plot confirms the same (darkest region corresponds to best values).
The following snippet of code uses the best model to predict and prepare a confusion matrix, as follows:
# get best model and evaluate predictions svm.model.best = tuning.results$best.model svm.predictions.best <- predict(svm.model.best, test.dtMatrix) confusionMatrix(data=svm.predictions.best, reference=true.labels, positive="positive")
The following confusion matrix shows the predictions from a much improved model. From a mere 50% accuracy to a comfortable 80% and above is a good leap. Let us check the ROC curves for this model to confirm that the accuracy is indeed good enough:
To prepare the ROC curves, we will reuse our utility script performance_plot_utils.R
from Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, and pass the predictions from the optimized model to it:
# plot best model evaluation metric curves svm.predictions.best <- predict(svm.model.best, test.dtMatrix, decision.values = T) svm.prediction.values <- attributes(svm.predictions.best) $decision.values predictions <- prediction(svm.prediction.values, true.labels) par(mfrow=c(1,2)) plot.roc.curve(predictions, title.text="SVM ROC Curve") plot.pr.curve(predictions, title.text="SVM Precision/Recall Curve")
The ROC curves generated using the preceding code snippet are as follows:
The ROC curves also confirm a well-learned model with an AUC of 0.89. We can therefore use this model to classify tweets into positive or negative classes. We encourage readers to try out ROC-based optimizations and observe if there are any further improvements in the model.
Supervised Machine Learning algorithms, in a nutshell, are about learning the underlying functions or patterns which help us predict accurately (within certain bounds) based on historic data. Over the course of this book, we have come across many such algorithms and, although R makes it easy to code and test these, it is worth mentioning that learning a highly accurate function or pattern is not an easy task. Building highly complex models leads us to issues of overfitting and underfitting, to name a few. Amidst all this confusion, it is to be noted that it is always easy to learn simple rules and functions.
For example, to classify an email as spam or not spam there are multiple rules which a machine learning algorithm would have to learn, rules such as:
And many more such rules. Given a training dataset, say T
of labeled emails, a machine learning algorithm (specifically a classification algorithm) will generate a classifier, C
, which is a hypothesis of an underlying function or pattern. We then use this classifier C
to predict the labels for new emails.
On the other hand, an ensemble of classifiers is defined as a set of classifiers whose outputs are combined in some way to classify new examples. The main discovery in the field of machine learning-based ensembles is that ensembles perform much better than the individual classifiers they are made of.
A necessary and sufficient condition for ensembles to be better than their constituents is that they should be accurate and diverse. A classifier is termed accurate if its predictions are better than random guessing (see weak learners ). While two classifiers are termed as diverse if they make different errors on the same data points.
We can define a weak learner as a learner whose predictions and decisions are at least better than random guessing. Weak learners are also termed as base learners or meta learners.
The following block diagram visualizes the concept of ensemble classifiers:
As seen in the preceding block diagram, the training dataset is split into n datasets (the splitting or generation of such datasets is dependent upon the ensemble-ing methodology) upon which weak learners (the same or different weak learners, again, depends upon the ensemble methodology) build models. These models are then combined based on weighted or unweighted voting to prepare a final model, which is used for classification. The mathematical proofs of why ensembles work are fairly involved and beyond the scope of this book.
There are various ways of constructing ensemble classifiers (or regressors) and boosting is one of them. Boosting came out as an answer by Robert Schapire in his pioneering paper in 1990 entitled The Strength of Weak Learnability, where he elegantly describes the boosting ensemble while answering questions posed by Kearns and Valiant in their paper published in 1989, which talks about multiple weak learners that can create a single strong learner.
The Strength of Weak Learnability: http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf.
Kearns and Valiant: Cryptographic limitations on learning Boolean Learning and finite automata: http://dl.acm.org/citation.cfm?id=73049
The original algorithm for boosting was revised by Freund and Schapire, and termed as AdaBoost or Adaptive Boosting. This algorithm was practically implementable and empirically improves generalization performance. The algorithm can be mathematically presented as follows:
Here:
In simple words, boosting, in general, begins by initially assigning equal weights to all training examples. It then iterates over the hypothesis space to learn a hypothesis ht on the weighted examples. After each such hypothesis is learned, the weights are adjusted in such a manner that the weights of the examples that are correctly classified are reduced. This update to weights helps weak learners, in coming iterations, to concentrate more on wrongly classified data points. Finally, each of the learned hypotheses is then passed through a weighted voting to come up with a final model, H.
Now that we have an overview of ensemble methods and boosting in general, let us use the boosting implementation available from the RTextTools
library in R to classify tweets as positive or negative.
We will reuse the training-testing document term matrices train.dtMatrix
and test.dtMatrix
, and container objects train.container
and test.container
, which we created for the SVM-based classification.
For building a classifier based on a boosting ensemble, RTextTools
provides an easy-to-use utility function called train_model
. It uses LogitBoosting internally to build a classifier. We use 500
iterations for building our boosting ensemble.
boosting.model <- train_model(train.container, "BOOSTING" , maxitboost=500) boosting.classify <- classify_model(test.container, boosting.model)
We then prepare a confusion matrix to see how our classifier performs on the test dataset.
predicted.labels <- boosting.classify[,1] true.labels <- as.factor(test.data[,1]) confusionMatrix(data = predicted.labels, reference = true.labels, positive = "positive")
The following confusion matrix shows that our boosting-based classifier works with an accuracy of 78.5%, which is fairly good given the fact that we did not perform any performance tuning. Compare this to the initial iteration of SVM where we got a dismal accuracy of just over 50%.
As mentioned earlier, ensemble methods (specifically boosting) have improved generalized performance, that is, they help achieve close to 0 training errors without overfitting on the training data. To understand and evaluate our Boosting classifier on these parameters, we will use a model-evaluation technique called Cross-validation.
Cross-validation is a model-evaluation technique which is used to evaluate the generalization performance of a model. It is also termed rotational estimation. Cross-validation is a better measure to validate a model for generalization compared to residual methods because, for conventional validation techniques, the error (such as Root Mean Square Error/RMSE) for the training set and testing set does not properly represent the model's performance. Cross-validation can be performed using:
We can easily perform K-fold cross validation on our boosting classifier using the cross_validate
function. In general, 10-fold cross validation is used:
# Cross validation N=10 set.seed(42) cross_validate(train.container,N,"BOOSTING" , maxitboost=500)
The preceding code snippet produces the following cross validation summary:
The results show that the classifier has generalized well enough, and has an overall mean accuracy of 97.8%.
Boosting is one of the methods to construct ensemble classifiers based on weak learners. Methods such as bagging, bayes optimal classifier, bucketing, and stacking are some of the variants with their own pros and cons.
Constructing ensembles
RTextTools
is a robust library which provides functions such as train_models
and classify_models
to prepare ensembles by combining various base learners. It also provides tools for generating analysis for evaluating the performance of such ensembles in a very detailed manner. Check out the detailed explanation at https://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf.