Twitter sentiment analysis is another powerful tool in the text analytics toolbox.
With sentiment analysis, we can analyze the mood expressed within a text.
In this recipe, we will download tweets relating to "data science with R" and perform a sentiment analysis on them, employing the bag of word technique.
The main feature of this technique is not trying to understand the meaning of the analyzed text but just looking at one word at the time, seeing whether it expresses a positive or negative sentiment.
Our example will therefore result in computing the overall sentiment around the topic, algebraically summing up the sentiment score of every single word in our bag.
This recipe will leverage powerful functions from three different packages: one for downloading tweets, one for string manipulation, and the last one for text analytics activities.
We therefore need to install and load those packages:
install.packages(c("twitteR","stringr","tm")) library(twitteR) library(stringr) library(tm)
Refer to Chapter 1, Acquiring Data for Your Project, and the Getting data from Twitter with the twitteR package recipe for further details on how to set up a Twitter application:
setup_twitter_oauth(consumer_key = 'xxxx', consumer_secret = 'xxxx', access_token = 'xxxx', access_secret = 'xxxx')
tweet_list <- searchTwitter('"data science with R"', n = 20)
tweet_df <- twListToDF(tweet_list)
pos_words = read.csv("lexicon/positive.txt",header = FALSE,stringsAsFactors=FALSE) pos_words <- c(pos_words) pos_words <- unlist(pos_words) pos_words <- unname(pos_words) pos_words <- tolower(pos_words) neg_words = read.csv("lexicon/negative.txt",header = FALSE,stringsAsFactors=FALSE) neg_words <- c(neg_words) neg_words <- unlist(neg_words) neg_words <- unname(neg_words) neg_words <- tolower(neg_words)
tweets <- tweet_df[,1]
gsub()
function and a regular expression:tweets <- gsub('[[:punct:]]', '', tweets) tweets <- gsub('[[:cntrl:]]', '', tweets) tweets <- gsub('\d+', '', tweets) tweets <- gsub("RT", '',tweets) tweets <- gsub("�», ‹›,tweets)
tweets <- removeWords(tweets,stopwords(kind = "en"))
str_split()
function from stringr
package:n = 140, maximum number of letters:
word_df <- str_split_fixed(tweets, '\s+',n = 140) word_df <- data.frame(word_df,"RT" = tweet_df[,12]) word_count <- melt(word_df,id <- c("RT")) word_df <- data.frame( "word" = word_count[,3], "RT" = word_count[,1])
word_df$is_positive <- match(unlist(word_count[,3]), pos_words) word_df$is_negative <- match(unlist(word_count[,3]),neg_words)
word_df <- subset(word_df,word_df[,1] != "")
sentiment_scorer <- function(pos_match,neg_match) { if (is.na(pos_match) && is.na(neg_match)) {0} else { if(is.na(pos_match) && is.na(neg_match) == FALSE){-1} else {1} } }
sentiment_scorer()
function:word_df <- data.frame(word_df,score = mapply(sentiment_scorer,word_df$is_positive,word_df$is_negative))
popularity_scorer <- function(rt,basic_score) { if(rt == 0){basic_score} else{rt * basic_score} } word_df$final_score <- mapply(popularity_scorer,word_df$RT,word_df$score)
total_df <- aggregate(word_df$final_score,list(word_df$word),sum) cloud <- wordcloud(total_df$Group,abs(total_df$x),scale=c(10,.20),colors=brewer.pal(10,"Spectral"))
Find out the total sentiment score:
total_sentiment <- sum(word_df$final_score)
In step 3, we define positive and negative words. This step involves reading the positive.txt
and negative.txt
files in the R environment and manipulating them in order to produce a familiar vector, such as the following one:
> head(pos_words) [1] "a+" "abound" "abounds" "abundance" "abundant" "accessable"
Since we read our words from a txt file and the contents first come into R as a list, we first have to unlist the pos_words
list and remove the row names.
A final touch is added, changing all capital letters to lowercase in order to ensure comparability with tweet text.
In step 5, we clean up tweets with gsub()
and regular expressions. In this step, we remove punctuation and other specific words from our tweets, iterating the gsub()
application on them.
This function only requires that you have a pattern to look for and an object that you can look for the pattern.
In step 6, we remove stopwords. Stopwords are words such as "and", "or," "even," and other common words in a language. Since they add no great value to the text in terms of comprehension, they are usually removed.
If you want to have a look at those words, you just need to run the following command:
> head(stopwords(kind = "en")) [1] "i" "me" "my" "myself" "we" "our"
In step 7, we split tweets into words. Using the str_split_fixed()
function, we split our tweets into separate words in order to apply text analytics techniques on them, like the ones seen in the previous recipes.
This function requires that you specify two main arguments:
After applying this function to all our tweets, we now have a data frame with the following structure:
Here, each row corresponds to a tweet.
Since we will use the number of retweets in order to compute our final sentiment score, we will now add this information to the data frame with the following line of code:
word_df <- data.frame(word_df,"RT" = tweet_df[,12]) Once we do this, the data frame will look like this:
This is not what we need yet, since our minimum object of analysis is the single word.
What we are looking for is actually a tidy dataset, like the ones introduced in Chapter 2, Preparing for Analysis – Data Cleansing and Manipulation, where each row stores an observation.
In order to obtain this kind of dataset, we will apply the melt()
function, which will create a unique column from all words, replicating the number of retweets received from the tweet for each word.
In step 8, we match word with lexicon. In this step, we associate two new attributes to the word_df
data frame:
is_positive
data frame, which is true for words that are found within the pos_words
vector of positive words.is_negative
data frame, which is true for words that are found within the neg_words
vector.We then removed words that are neither positive nor negative, and are therefore not relevant for sentiment analysis purposes.
In step 9, we define a scoring function. Our scoring function assigns 1
to every positive word and -1
to every negative word. Quite linear, isn't it?
In step 10, we apply the function. We now apply the defined function to our real data, obtaining a new score
attribute, which applies 1
for positive words and -1
for negative words.
In step 11, we compute a final score by multiplying the score by the number of retweets. Since we want to take into consideration the number of retweets of a single tweet, we define a new function. In the case of the absence of retweets, this function leaves the score untouched, while where retweets are found, it will multiply the score by the number of retweets.
Applying this function over the vector of words and scores will result in a new column, named final_score
, containing weighted scores.
You may be asking, "So, the sentiment of a tweet increases if it gets retweeted?"
Well, we are actually measuring the sentiment around a topic, not a single tweet sentiment (and the topic is the one defined in step 2).
We therefore count 1
for each positive tweet and -1
for each negative tweet.
If a tweet got retweeted, we consider it as another positive/negative tweet on that topic, therefore adding +1/-1
.
This is actually a distinctive point of this analysis, since usually data mining activities around downloaded tweets make it impossible to take retweets into account.
In step 12, we show the results. This step performs two tasks:
final_df
, where the sum of scores are obtained for each word (aggregating repeated words).In step 13, we look at the total sentiment score. Since the score of positive words are positive numbers and score of negative words are negative numbers, we can sum up all scores and understand whether the general mood around the given search key is positive or negative.