Sentiment analysis upon Tweets

Now that we are equipped with the key terms and concepts from the world of Sentiment Analysis, let us put our theory to the test. We have seen some major application areas for Sentiment Analysis and the challenges faced, in general, to perform such analytics. In this section we will perform Sentiment Analysis categorized into:

  • Polarity analysis: This will involve the scoring and aggregation of sentiment polarity using a labeled list of positive and negative words.
  • Classification-based analysis: In this approach we will make use of R's rich libraries to perform classification based on labeled tweets available for public usage. We will also discuss their performance and accuracy.

R has a very robust library for the extraction and manipulation of information from Twitter called TwitteR. As we saw in the previous chapter, we first need to create an application using Twitter's application management console before we can use TwitteR or any other library for sentiment analysis. For this chapter, we will be reusing the application from the previous chapter (keep your application keys and secrets handy). Also, in the coming sections, we will be utilizing our code from previous chapters in a more structured format to enable reuse and to follow #bestCodingPractices.

Before we begin our analysis, let us first restructure our existing code and write some helper functions, which will come in handy later on. As we know, data from Twitter can be extracted using search terms or from a user's timeline. The following two helper functions help us to do exactly the same tasks in a reusable fashion:

#extract search tweets
extractTweets <- function(searchTerm,tweetCount){
  # search term tweets
  tweets = searchTwitter(searchTerm,n=tweetCount)
  tweets.df = twListToDF(tweets)
  tweets.df$text <- sapply(tweets.df$text,function(x) iconv(x,to='UTF-8'))
  
  return(tweets.df)
}

#extract timeline tweets
extractTimelineTweets <- function(username,tweetCount){
  # timeline tweets
  twitterUser <- getUser(username)
  tweets = userTimeline(twitterUser,n=tweetCount)
  tweets.df = twListToDF(tweets)
  tweets.df$text <- sapply(tweets.df$text,function(x) iconv(x,to='UTF-8'))
  
  return(tweets.df)
}

The function extractTweets takes the search term and number of tweets to be extracted as inputs and returns the data in a data frame which contains text converted to UTF8 encoding. Similarly, the function extractTimelineTweets takes the username and number of tweets as inputs and returns data in a data frame with the text converted to UTF8 encoding. Therefore, the preceding two functions will help us to extract tweets multiple times (based on different search terms or users) without rewriting the same lines of code again and again.

Continuing with the same theme, we will write another helper function to clean and transform our data set. As we saw in the previous chapter, R's tm library provides us with various utility functions to quickly clean and transform text corpus. In this function, we will make use of tm_map to transform our tweets:

# clean and transform tweets
transformTweets <- function(tweetDF){
  tweetCorpus <- Corpus(VectorSource(tweetDF$text))
  tweetCorpus <- tm_map(tweetCorpus, tolower)
  tweetCorpus <- tm_map(tweetCorpus, removePunctuation)
  tweetCorpus <- tm_map(tweetCorpus, removeNumbers)
  
  # remove URLs
  removeURL <- function(x) gsub("http://[[:alnum:]]*", "", x)
  tweetCorpus <- tm_map(tweetCorpus, removeURL) 
  
  # remove stop words
  twtrStopWords <- c(stopwords("english"),'rt','http','https')
  tweetCorpus <- tm_map(tweetCorpus, removeWords, twtrStopWords)
  
  tweetCorpus <- tm_map(tweetCorpus, PlainTextDocument)
  
  #convert back to dataframe
  tweetDataframe <- data.frame(text=unlist(sapply(tweetCorpus, 
                    `[`, "content")), stringsAsFactors=F)
  
  #split each doc into words
  splitText <- function(x) {
    word.list = str_split(x, '\s+')
    words = unlist(word.list)
  }
  
  # attach list of words to the data frame
  tweetDataframe$wordList = sapply(
                    tweetDataframe$text,
                    function(text) splitText(text))
  
  return (tweetDataframe)
}

In addition to the usual transformations, such as stop word removal, change to lower case, punctuation removal, and so on, the function transformTweets tokenizes each tweet at word level and attaches the list of words in each tweet to the object. Also, the function returns the transformed tweets in a data frame for further manipulation.

Polarity analysis

Polarity, as discussed in the section Key Concepts, is the positive, negative or neutral classification of the piece of text in consideration. The class labels may change depending upon the context (liked versus disliked or favorable versus unfavorable). Polarity may also have a degree attached to it which places the analyzed text on a continuous (or discrete) scale of polarities (say from -5 to 5). This degree of polarity helps us analyze the extent (or degree) of positivity (or negativity) in the text. This is particularly useful in comparative studies as we have the opportunity to view analyzed text with reference to certain benchmarks.

In this section, we will analyze tweets and score each of them based on the polar words identified in each of the tweets. The simple and easy-to-code algorithm is outlined in the following steps:

  1. Extract tweets based on selected search terms or Twitter handles.
  2. Clean and transform tweets into a suitable format for ease of analysis. Tokenize tweets into a constituent list of words.
  3. Load the list of positive and negative words to be used for polar word identification.
  4. For each tweet, count the number of positive and negative words that match the list of positive and negative words obtained in the preceding step 3.
  5. Assign a polarity score to each tweet based on the difference between positive and negative matches in the preceding step.

The preceding steps are represented diagrammatically as follows:

Polarity analysis

Once each tweet in the dataset has been scored, we may aggregate the scores to understand the overall sentiment distribution related to the search terms or Twitter handle. Positive values define a positive sentiment; larger numbers denote a greater degree of positivity, and similarly for negative sentiments. A neutral stance is represented by a score of 0. For example, This car is amazingly fast and beautiful has a greater degree of positivity than This is a nice car, though both are positive sentences.

Let us use this algorithm to analyze sentiments using search terms and Twitter handles. As discussed previously, opinion mining has become essential, not just for brands but for governments as well. Every entity out there wants to gauge how its target audience feels about it and its initiatives, and governments are no exception. Of late, the Indian Government has been utilizing Twitter and other social media platforms effectively to reach its audience and make them aware about its initiatives and policies. One such initiative is the recently launched Make in India initiative. Consider a scenario where one is tasked with analyzing the effectiveness of and public opinion related to such an initiative. To analyze public opinion, which changes dynamically over time, Twitter would be a good choice. So, to analyze sentiments for the Make in India initiative, let us analyze some tweets.

As previously outlined, we start by connecting to Twitter and extracting tweets related to the search term Make In India. This is followed by the preprocessing step, where we remove stop words, URLs, and so on to transform the tweets into a usable format. We also tokenize each tweet into a list of constituent words for use in the coming steps. Once our dataset is ready and in a consumable format, we load the precompiled list of positive and negative words. The list is available from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html.

We first write a reusable analyzeTrendSentiments function which takes the search term and number of tweets to be extracted as inputs. It makes use of the functions extractTweets and transformTweets to get the job done:

analyzeTrendSentiments <- function(search,tweetCount){ 
  
  #extract tweets
  tweetsDF <- extractTweets(search,tweetCount)
  
  # transformations
  transformedTweetsDF <- transformTweets(tweetsDF)
  
  #score the words  
  transformedTweetsDF$sentiScore = sapply(transformedTweetsDF$wordList,function(wordList) scoreTweet(wordList))

  transformedTweetsDF$search <- search
  
  return(transformedTweetsDF) 
}

We then use the function analyzeTrendSentiments to get a data frame consisting of tweets scored using a precompiled list of polar words. We use twitteR, ggplot2, stringr and tm libraries as well:

library(twitteR)
library(stringr)
library(tm)
library(ggplot2)

consumerSecret = "XXXXXXXXXX"
consumerKey = "XXXXXXXXXXXXXXXXXXXXXXXXX"

setup_twitter_oauth(consumer_key = consumerKey,consumer_secret = consumerSecret)

# list of positive/negative words from opinion lexicon
pos.words = scan(file= 'positive-words.txt', what='character', comment.char=';')

neg.words = scan(file= 'negative-words.txt', what='character', comment.char=';')

#extract 1500 tweets on the given topic
makeInIndiaSentiments <- analyzeTrendSentiments("makeinindia",1500)

#plot the aggregated scores on a histogram
qplot(makeInIndiaSentiments $sentiScore)

In the last chapter, we learned and used different visualizations to grasp the insights hidden in our analysis. Continuing with the same thought process, we generate a histogram of aggregated scores. The visualization looks like this:

Polarity analysis

The histogram is easy to interpret. It shows the tweets distributed across a polarity scale on the x-axis and frequency of tweets on the y-axis. The results show a normal distribution with a general tilt towards the positive side. It seems the initiative is getting a positive response from its audience.

Going a bit deeper into the analysis itself, let us analyze the sentiments for the same search term and see how the opinions change over time.

Note

The tweets for this analysis were extracted on the day the initiative was launched as well as a day later. Your results may vary due to the dynamic nature of Twitter. You may observe a difference in outcomes across other examples in this chapter as well. We urge you to be creative and try out other trending topics while working through examples from this chapter.

The output looks like this:

Polarity analysis

The preceding two histograms show a shift in opinions over the course of two days. If you were following the news at the time, in one of the events for this initiative a sudden fire broke out and burnt the whole stage. The graph on top is based upon tweets after the fire broke out while the graph labeled makeinindia_yday refers to the tweets from the day before. Though the shift in sentiments isn't drastic, it is clearly visible that the shift has been more towards the positive side (some tweets are even hitting a score of 6+). Could this be because tweeple started praising the emergency teams and police for their quick action in preventing casualties? Well, it looks like Twitter isn't just about people ranting on random stuff!

Note

World leaders

Twitter has caught the frenzy of celebrities and politicians alike. As a quick exercise, try analyzing tweets from the twitter handles of world leaders such as @potus, @pmoindia, and @number10gov to see what kind of opinions our leaders project through Twitter. Don't be surprised if their timelines are neutral...oops, diplomatic!

Polarity analysis

Classification-based algorithms

A classification problem requires the labeling of input data into required classes based on some defined characteristics of each class (see Chapter 2, Let's Help Machines Learn, for details). In the case of sentiment analysis, the classes are positive and negative (or neutral in certain cases). We have learned about different classification algorithms and seen how they are used across domains to solve categorization and classification problems in the previous chapters, and sentiment analysis is yet another domain where these algorithms are highly useful.

In this section, we will perform opinion mining using classification algorithms such as SVM and boosting. We will also touch upon ensemble methods and see how they help to improve performance. Note that, for this section, we will concentrate only on the positive and negative polarities, but the approach is generic enough to be easily extended to include the neutral polarity as well.

Labeled dataset

Since this is a supervised learning approach, we require labeled data for training and testing the performance of our algorithms. For the purpose of this chapter, we will utilize a labeled dataset from http://www.sentiment140.com/. It contains tweets labeled as 0, 2, and 4 for negative, neutral and positive sentiments, respectively. There are various attributes such as tweet ID, date, search query, username, and the tweet text, apart from the sentiment label. For our case we will be considering only the tweet text and its corresponding label.

Labeled dataset

Note

Another source of labeled tweets is available at https://github.com/guyz/twitter-sentiment-dataset. This source makes use of a python script to download around 5000 labeled tweets keeping Twitter API guidelines in mind.

Before we get into the algorithm-specific details, let us look into the labeled dataset and perform the initial steps of collecting and transforming our data into the required forms. We will make use of libraries such as caret and RTextTools for these steps.

As mentioned previously, the dataset contains polarities labeled as 0, 2, and 4 for negative, neutral, and positive. We will load the csv file in R and apply a quick transformation to change the labels to positive and negative. Once the polarities have been transformed into intelligible names, we will filter out the rows of data containing neutral sentiments. Also, we will keep only the columns for polarity and tweet text, and remove the rest.

# load labeled dataset
labeledDSFilePath = "labeled_tweets.csv"
labeledDataset = read.csv(labeledDSFilePath, header = FALSE)

# transform polarity labels
labeledDataset$V1 = sapply(labeledDataset$V1, 
    function(x) 
      if(x==4) 
        x <- "positive" 
      else if(x==0) 
        x<-"negative" 
      else x<- "none")

#select required columns only
requiredColumns <- c("V1","V6")

# extract only positive/negative labeled tweets 
tweets<-as.matrix(labeledDataset[labeledDataset$V1 
      %in% c("positive","negative")
      ,requiredColumns])

The tweets object is now available as a matrix with each row representing a tweet, and with columns referring to polarity and tweet text. Before we transform this matrix into the formats required by the classification algorithms, we need to split our data into training and testing datasets (see Chapter 2, Let's Help Machines Learn, for more on this). Since both the training and testing datasets should contain a good enough distribution of samples of all classes for the purposes of training and testing, we use the createDataPartition function available from the caret package. For our use case, we split our data into 70/30 training and testing datasets:

indexes <- createDataPartition(tweets[,1], p=0.7, list = FALSE)

train.data <- tweets[indexes,]
test.data <- tweets[-indexes,]

We perform a quick check to see how our data is split across the positive and negative classes in our original dataset, and the training and testing datasets. You will see the result in the following screenshot:

Labeled dataset

As we can see, createDataPartition has done a nice job of maintaining a similar sentiment distribution across the training and testing datasets.

Next in the line of transformations is the Document Term Matrix transformation. As we have seen in Chapter 7, Social Media Analysis – Analyzing Twitter Data, a document term matrix transforms a given dataset into rows representing the documents and columns of terms (words/sentences). Unlike the previous chapter, where we used the tm library's DocumentTermMatrix function for transformation and applied various transformations using tm_map, for the current use case we will use the create_matrix function from the RTextTools library. This function is an abstraction over tm's corresponding functions. We will also assign weights to each of the terms using tfidf as our feature. The create_matrix method also helps us take care of splitting sentences into words, stop words and number removal, and stemming them as well. Here's how you do it:

train.dtMatrix <- create_matrix(train.data[,2], 
                        language="english" , 
                        removeStopwords=TRUE, 
                        removeNumbers=TRUE,
                        stemWords=TRUE,
                        weighting = tm::weightTfIdf)
            

            
test.dtMatrix <- create_matrix(test.data[,2], 
                               language="english" , 
                               removeStopwords=TRUE, 
                               removeNumbers=TRUE,
                               stemWords=TRUE,
                               weighting = tm::weightTfIdf,
                               originalMatrix=train.dtMatrix)

test.data.size <- nrow(test.data)

Note

The create_matrix method in RTextTools v1.4.2 has a small bug which prevents weight assignment when using originalMatrix option. The following small hack can be used to fix the issue till the library gets updated:

>  trace("create_matrix",edit=T) 

Scroll to line 42 and update Acronym to acronym.

Check the following links for more details and alternate ways of handling this issue:

https://github.com/timjurka/RTextTools/issues/4

http://stackoverflow.com/questions/16630627/recreate-same-document-term-matrix-with-new-data

Now that we have both the training and testing datasets in the DocumentTermMatrix format, we can proceed towards the classification algorithms and let our machines learn and build sentiment classifiers!

Support Vector Machines

Support Vector Machines, or SVM as they are commonly known, are one of the most versatile classes of supervised learning algorithms for classification. An SVM builds a model in such a way that the data points belonging to different classes are separated by a clear gap, which is optimized such that the distance of separation is the maximum possible. The samples on the margins are called the support vectors, which are separated by a hyperplane (see Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics for more details).

Since our current use case for sentiment analysis is also a binary (positive and negative) classification problem, SVM helps us build a model using the training dataset, which separates tweets into positive and negative sentiment classes, respectively.

We will use e1071 library's svm function to build a sentiment classifier. We start off with the default values for the SVM classifier available from the library and follow the same iterative procedure we did in Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, to finally arrive at the best classifier. The following snippet of code builds a sentiment classifier using the default values and then prints a confusion matrix, along with other statistics for evaluation, as shown in the following code snippet:

svm.model <- svm(train.dtMatrix, as.factor(train.data[,1]))

## view inital model details
summary(svm.model)

## predict and evaluate results
svm.predictions <- predict(svm.model, test.dtMatrix)

true.labels <- as.factor(test.data[,1])

confusionMatrix(data=svm.predictions, reference=true.labels, positive="positive")

The confusion matrix generated as follows shows that the classifier has just 50% accuracy, which is as bad as a coin toss, with no predictions for negative sentiments whatsoever! It seems like the classifier couldn't infer or learn much from the training dataset.

Support Vector Machines

To build a better-performing model, we will now go under the hood and tweak some parameters. The svm implementation from e1071 provides us with a wonderful utility called tune to obtain the optimized values of hyperparameters using a grid search over the given parameter ranges:

## hyperparameter optimizations

# run grid search
cost.weights <- c(0.1, 10, 100)
gamma.weights <- c(0.01, 0.25, 0.5, 1)
tuning.results <- tune(svm, train.dtMatrix, as.factor(train.data[,1]), kernel="radial", 
                       ranges=list(cost=cost.weights, gamma=gamma.weights))

# view optimization results
print(tuning.results)

# plot results
plot(tuning.results, cex.main=0.6, cex.lab=0.8,xaxs="i", yaxs="i")

Note

In the code snippet above, we have utilized the radial bias kernel (or rbf for short) for hyperparameter optimization. The motivation for using rbf was due to its better performance with respect to specificity and sensitivity even though the overall accuracy was comparable to linear kernels. We urge our readers to try out linear kernels and observe the difference in the overall results. Please note that, for text classification, linear kernels usually perform better than other kernels, not only in terms of accuracy but in performance as well

The parameter-tuning results in optimized values for hyperparameters cost and gamma as 10 and 0.01, respectively; the following plot confirms the same (darkest region corresponds to best values).

Support Vector Machines

The following snippet of code uses the best model to predict and prepare a confusion matrix, as follows:

# get best model and evaluate predictions
svm.model.best = tuning.results$best.model

svm.predictions.best <- predict(svm.model.best, test.dtMatrix)

confusionMatrix(data=svm.predictions.best, reference=true.labels, positive="positive")

The following confusion matrix shows the predictions from a much improved model. From a mere 50% accuracy to a comfortable 80% and above is a good leap. Let us check the ROC curves for this model to confirm that the accuracy is indeed good enough:

Support Vector Machines

To prepare the ROC curves, we will reuse our utility script performance_plot_utils.R from Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, and pass the predictions from the optimized model to it:

# plot best model evaluation metric curves
svm.predictions.best <- predict(svm.model.best, test.dtMatrix, decision.values = T)

svm.prediction.values <- attributes(svm.predictions.best)
$decision.values

predictions <- prediction(svm.prediction.values, true.labels)

par(mfrow=c(1,2))
plot.roc.curve(predictions, title.text="SVM ROC Curve")
plot.pr.curve(predictions, title.text="SVM Precision/Recall Curve")

The ROC curves generated using the preceding code snippet are as follows:

Support Vector Machines

The ROC curves also confirm a well-learned model with an AUC of 0.89. We can therefore use this model to classify tweets into positive or negative classes. We encourage readers to try out ROC-based optimizations and observe if there are any further improvements in the model.

Ensemble methods

Supervised Machine Learning algorithms, in a nutshell, are about learning the underlying functions or patterns which help us predict accurately (within certain bounds) based on historic data. Over the course of this book, we have come across many such algorithms and, although R makes it easy to code and test these, it is worth mentioning that learning a highly accurate function or pattern is not an easy task. Building highly complex models leads us to issues of overfitting and underfitting, to name a few. Amidst all this confusion, it is to be noted that it is always easy to learn simple rules and functions.

For example, to classify an email as spam or not spam there are multiple rules which a machine learning algorithm would have to learn, rules such as:

  • E-mails containing text such as buy now are spam
  • E-mails containing more than five hyperlinks are spam
  • E-mails from contacts in the address book are not spam

And many more such rules. Given a training dataset, say T of labeled emails, a machine learning algorithm (specifically a classification algorithm) will generate a classifier, C, which is a hypothesis of an underlying function or pattern. We then use this classifier C to predict the labels for new emails.

On the other hand, an ensemble of classifiers is defined as a set of classifiers whose outputs are combined in some way to classify new examples. The main discovery in the field of machine learning-based ensembles is that ensembles perform much better than the individual classifiers they are made of.

A necessary and sufficient condition for ensembles to be better than their constituents is that they should be accurate and diverse. A classifier is termed accurate if its predictions are better than random guessing (see weak learners ). While two classifiers are termed as diverse if they make different errors on the same data points.

We can define a weak learner as a learner whose predictions and decisions are at least better than random guessing. Weak learners are also termed as base learners or meta learners.

The following block diagram visualizes the concept of ensemble classifiers:

Ensemble methods

As seen in the preceding block diagram, the training dataset is split into n datasets (the splitting or generation of such datasets is dependent upon the ensemble-ing methodology) upon which weak learners (the same or different weak learners, again, depends upon the ensemble methodology) build models. These models are then combined based on weighted or unweighted voting to prepare a final model, which is used for classification. The mathematical proofs of why ensembles work are fairly involved and beyond the scope of this book.

Boosting

There are various ways of constructing ensemble classifiers (or regressors) and boosting is one of them. Boosting came out as an answer by Robert Schapire in his pioneering paper in 1990 entitled The Strength of Weak Learnability, where he elegantly describes the boosting ensemble while answering questions posed by Kearns and Valiant in their paper published in 1989, which talks about multiple weak learners that can create a single strong learner.

Note

The Strength of Weak Learnability: http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf.

Kearns and Valiant: Cryptographic limitations on learning Boolean Learning and finite automata: http://dl.acm.org/citation.cfm?id=73049

The original algorithm for boosting was revised by Freund and Schapire, and termed as AdaBoost or Adaptive Boosting. This algorithm was practically implementable and empirically improves generalization performance. The algorithm can be mathematically presented as follows:

Here:

  • X is the training set
  • Y is the label set
  • Dt(i) is the weight distribution on training example i on iteration t
  • ht is the classifier obtained in iteration t
  • α is the strength parameter or weight of ht
  • H is the final or combined classifier.

In simple words, boosting, in general, begins by initially assigning equal weights to all training examples. It then iterates over the hypothesis space to learn a hypothesis ht on the weighted examples. After each such hypothesis is learned, the weights are adjusted in such a manner that the weights of the examples that are correctly classified are reduced. This update to weights helps weak learners, in coming iterations, to concentrate more on wrongly classified data points. Finally, each of the learned hypotheses is then passed through a weighted voting to come up with a final model, H.

Now that we have an overview of ensemble methods and boosting in general, let us use the boosting implementation available from the RTextTools library in R to classify tweets as positive or negative.

We will reuse the training-testing document term matrices train.dtMatrix and test.dtMatrix, and container objects train.container and test.container, which we created for the SVM-based classification.

For building a classifier based on a boosting ensemble, RTextTools provides an easy-to-use utility function called train_model. It uses LogitBoosting internally to build a classifier. We use 500 iterations for building our boosting ensemble.

boosting.model <- train_model(train.container, "BOOSTING"
            , maxitboost=500)
boosting.classify <- classify_model(test.container, boosting.model)

We then prepare a confusion matrix to see how our classifier performs on the test dataset.

predicted.labels <- boosting.classify[,1]
true.labels <- as.factor(test.data[,1])

confusionMatrix(data = predicted.labels, 
                reference = true.labels, 
                positive = "positive")

The following confusion matrix shows that our boosting-based classifier works with an accuracy of 78.5%, which is fairly good given the fact that we did not perform any performance tuning. Compare this to the initial iteration of SVM where we got a dismal accuracy of just over 50%.

Boosting

As mentioned earlier, ensemble methods (specifically boosting) have improved generalized performance, that is, they help achieve close to 0 training errors without overfitting on the training data. To understand and evaluate our Boosting classifier on these parameters, we will use a model-evaluation technique called Cross-validation.

Cross-validation

Cross-validation is a model-evaluation technique which is used to evaluate the generalization performance of a model. It is also termed rotational estimation. Cross-validation is a better measure to validate a model for generalization compared to residual methods because, for conventional validation techniques, the error (such as Root Mean Square Error/RMSE) for the training set and testing set does not properly represent the model's performance. Cross-validation can be performed using:

  • Holdout method: The simplest cross-validation technique. Data is split into training and testing sets. The model is fitted on the training set, and then the testing set (which the model hasn't seen so far) is used to calculate the mean absolute test error. This accumulated error is used to evaluate the model. This technique suffers from high variance due to its dependency on how the training-testing division was done.
  • K-fold cross validation method: This is an improvement over the holdout method. The dataset is divided into k subsets and then the holdout method is applied k times using 1 of the k subsets as test and the rest, k-1, as training sets. This method has a lower variance due to the fact that each data point gets to be in the test set once and in the training set k-1 times. The disadvantage is that more computation time is required due to the number of iterations. An extreme form of K-fold cross validation is the Leave-Out One cross-validation method where all data points except one are used for training. The process is repeated N (size of dataset) times.

We can easily perform K-fold cross validation on our boosting classifier using the cross_validate function. In general, 10-fold cross validation is used:

# Cross validation
N=10
set.seed(42)
cross_validate(train.container,N,"BOOSTING"
    , maxitboost=500)

The preceding code snippet produces the following cross validation summary:

Cross-validation

The results show that the classifier has generalized well enough, and has an overall mean accuracy of 97.8%.

Boosting is one of the methods to construct ensemble classifiers based on weak learners. Methods such as bagging, bayes optimal classifier, bucketing, and stacking are some of the variants with their own pros and cons.

Note

Constructing ensembles

RTextTools is a robust library which provides functions such as train_models and classify_models to prepare ensembles by combining various base learners. It also provides tools for generating analysis for evaluating the performance of such ensembles in a very detailed manner. Check out the detailed explanation at https://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset