Performing a Twitter sentiment analysis

Twitter sentiment analysis is another powerful tool in the text analytics toolbox.

With sentiment analysis, we can analyze the mood expressed within a text.

In this recipe, we will download tweets relating to "data science with R" and perform a sentiment analysis on them, employing the bag of word technique.

The main feature of this technique is not trying to understand the meaning of the analyzed text but just looking at one word at the time, seeing whether it expresses a positive or negative sentiment.

Our example will therefore result in computing the overall sentiment around the topic, algebraically summing up the sentiment score of every single word in our bag.

Getting ready

This recipe will leverage powerful functions from three different packages: one for downloading tweets, one for string manipulation, and the last one for text analytics activities.

We therefore need to install and load those packages:

install.packages(c("twitteR","stringr","tm"))
library(twitteR)
library(stringr)
library(tm)

How to do it...

  1. Set up a Twitter session.

    Refer to Chapter 1, Acquiring Data for Your Project, and the Getting data from Twitter with the twitteR package recipe for further details on how to set up a Twitter application:

    setup_twitter_oauth(consumer_key    = 'xxxx',
      consumer_secret  = 'xxxx',
      access_token     = 'xxxx',
      access_secret    = 'xxxx')
    
  2. Download tweets pertaining to a specific query:
    tweet_list <- searchTwitter('"data science with R"', n = 20)
    
  3. Create a data frame with the downloaded tweets:
    tweet_df   <-  twListToDF(tweet_list)
    
  4. Define positive and negative words:
    pos_words = read.csv("lexicon/positive.txt",header = FALSE,stringsAsFactors=FALSE)
    pos_words <- c(pos_words)
    pos_words <- unlist(pos_words)
    pos_words <- unname(pos_words)
    pos_words <- tolower(pos_words)
    neg_words = read.csv("lexicon/negative.txt",header = FALSE,stringsAsFactors=FALSE)
    neg_words <- c(neg_words)
    neg_words <- unlist(neg_words)
    neg_words <- unname(neg_words)
    neg_words <- tolower(neg_words)
    
  5. Extract tweet text from the tweets database:
    tweets <- tweet_df[,1]
    
  6. Clean up tweets with the gsub() function and a regular expression:
    tweets <- gsub('[[:punct:]]', '', tweets)
    tweets <- gsub('[[:cntrl:]]', '', tweets)
    tweets <- gsub('\d+', '', tweets)
    tweets <- gsub("RT", '',tweets)
    tweets <- gsub("�», ‹›,tweets)
    
  7. Remove stopwords:
    tweets <- removeWords(tweets,stopwords(kind = "en"))
    
  8. Split tweets into words using str_split() function from stringr package:

    n = 140, maximum number of letters:

    word_df    <- str_split_fixed(tweets, '\s+',n = 140)  
    word_df    <- data.frame(word_df,"RT" = tweet_df[,12])
    word_count <- melt(word_df,id <- c("RT"))
    word_df    <- data.frame( "word" = word_count[,3],
      
    "RT" = word_count[,1])
    
  9. Match each word with its lexicon:
    word_df$is_positive <- match(unlist(word_count[,3]), pos_words)
    word_df$is_negative <- match(unlist(word_count[,3]),neg_words)
    
  10. Remove blank rows:
    word_df <- subset(word_df,word_df[,1] != "")
    
  11. Define the scoring function:
    sentiment_scorer  <- function(pos_match,neg_match) {
      if (is.na(pos_match) && is.na(neg_match)) {0}
      else {
        if(is.na(pos_match) && is.na(neg_match) == FALSE){-1} else
        {1}
      }
      
    }
    
  12. Apply the sentiment_scorer() function:
    word_df <- data.frame(word_df,score = mapply(sentiment_scorer,word_df$is_positive,word_df$is_negative))
    
  13. Compute a final score by multiplying the score by the number of retweets:
    popularity_scorer <- function(rt,basic_score) {
      if(rt == 0){basic_score}
      else{rt * basic_score}
    }
    
    word_df$final_score <- mapply(popularity_scorer,word_df$RT,word_df$score)
    
  14. Show the results:
    total_df <- aggregate(word_df$final_score,list(word_df$word),sum)
    cloud    <- wordcloud(total_df$Group,abs(total_df$x),scale=c(10,.20),colors=brewer.pal(10,"Spectral"))
    

    Find out the total sentiment score:

    total_sentiment <- sum(word_df$final_score)
    

How it works...

In step 3, we define positive and negative words. This step involves reading the positive.txt and negative.txt files in the R environment and manipulating them in order to produce a familiar vector, such as the following one:

> head(pos_words)
[1] "a+"         "abound"     "abounds"    "abundance"  "abundant" "accessable"

Since we read our words from a txt file and the contents first come into R as a list, we first have to unlist the pos_words list and remove the row names.

A final touch is added, changing all capital letters to lowercase in order to ensure comparability with tweet text.

In step 5, we clean up tweets with gsub() and regular expressions. In this step, we remove punctuation and other specific words from our tweets, iterating the gsub() application on them.

This function only requires that you have a pattern to look for and an object that you can look for the pattern.

In step 6, we remove stopwords. Stopwords are words such as "and", "or," "even," and other common words in a language. Since they add no great value to the text in terms of comprehension, they are usually removed.

If you want to have a look at those words, you just need to run the following command:

> head(stopwords(kind = "en"))
[1] "i"      "me"     "my"     "myself" "we"     "our" 

In step 7, we split tweets into words. Using the str_split_fixed() function, we split our tweets into separate words in order to apply text analytics techniques on them, like the ones seen in the previous recipes.

This function requires that you specify two main arguments:

  • The string to split
  • The pattern to look for in order to define the splitting points

After applying this function to all our tweets, we now have a data frame with the following structure:

  • First word; second word; third word
  • First word; second word; NA; NA
  • First word; NA; NA; NA

Here, each row corresponds to a tweet.

Since we will use the number of retweets in order to compute our final sentiment score, we will now add this information to the data frame with the following line of code:

word_df <- data.frame(word_df,"RT" = tweet_df[,12])
Once we do this, the data frame will look like this:
  • First word; second word; third word; 20
  • First word; second word; NA; NA; 14
  • First word; NA; NA; NA; 2

This is not what we need yet, since our minimum object of analysis is the single word.

What we are looking for is actually a tidy dataset, like the ones introduced in Chapter 2, Preparing for Analysis – Data Cleansing and Manipulation, where each row stores an observation.

In order to obtain this kind of dataset, we will apply the melt() function, which will create a unique column from all words, replicating the number of retweets received from the tweet for each word.

In step 8, we match word with lexicon. In this step, we associate two new attributes to the word_df data frame:

  • The is_positive data frame, which is true for words that are found within the pos_words vector of positive words.
  • The is_negative data frame, which is true for words that are found within the neg_words vector.

We then removed words that are neither positive nor negative, and are therefore not relevant for sentiment analysis purposes.

In step 9, we define a scoring function. Our scoring function assigns 1 to every positive word and -1 to every negative word. Quite linear, isn't it?

In step 10, we apply the function. We now apply the defined function to our real data, obtaining a new score attribute, which applies 1 for positive words and -1 for negative words.

In step 11, we compute a final score by multiplying the score by the number of retweets. Since we want to take into consideration the number of retweets of a single tweet, we define a new function. In the case of the absence of retweets, this function leaves the score untouched, while where retweets are found, it will multiply the score by the number of retweets.

Applying this function over the vector of words and scores will result in a new column, named final_score, containing weighted scores.

You may be asking, "So, the sentiment of a tweet increases if it gets retweeted?"

Well, we are actually measuring the sentiment around a topic, not a single tweet sentiment (and the topic is the one defined in step 2).

We therefore count 1 for each positive tweet and -1 for each negative tweet.

If a tweet got retweeted, we consider it as another positive/negative tweet on that topic, therefore adding +1/-1.

This is actually a distinctive point of this analysis, since usually data mining activities around downloaded tweets make it impossible to take retweets into account.

In step 12, we show the results. This step performs two tasks:

  • It creates a final data frame, final_df, where the sum of scores are obtained for each word (aggregating repeated words).
  • It plots a word cloud where the size of each word is related to the absolute value of the final score of that word.

In step 13, we look at the total sentiment score. Since the score of positive words are positive numbers and score of negative words are negative numbers, we can sum up all scores and understand whether the general mood around the given search key is positive or negative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset