Twitter data mining

Now that we have tested our tools, libraries, and connections to Twitter APIs, the time has come to begin our search for the hidden treasures in Twitter land. Let's wear our data miner's cap and start digging!

In this section, we will be working on Twitter data gathered from searching keywords (or hashtags in Twitter vocabulary) and user timelines. Using this data, we will be uncovering some interesting insights while using different functions and utilities from TwitteR and other R packages.

Note

Please note that our process will implicitly follow the steps outlined for data mining. In the spirit of brevity, we might take the liberty to not mention each of the steps explicitly. We are mining for some gold-plated insights; rest assured nothing is skipped!

Every year, we begin with a new zeal to achieve great feats and improve upon our shortcomings. Most of us make promises to ourselves in the form of New Year's resolutions. Let us explore what tweeple are doing with their resolutions in 2016!

Note

Note: Twitter data changes very rapidly and your results/plots may vary from the ones depicted in this chapter.

We will use the same app and its credentials to connect and tap into Twitter for data. The following code works in exactly the same way that we extracted sample tweets in the previous section:

library(twitteR)
library(ggplot2)
library(stringr)
library(tm)
library(wordcloud)

consumerSecret = "XXXXXXXXX"
consumerKey = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

setup_twitter_oauth(consumer_key = consumerKey,consumer_secret = consumerSecret)

Apart from connecting to Twitter, we have also loaded required packages, such as ggplot, stringr, tm, and wordcloud. We will see where and how these packages are useful as we proceed.

Once connected to our data source, we can proceed towards collecting the required data. Since we are planning to learn about tweeple and their New Year's resolutions, we will extract data for the hashtag #ResolutionsFor2016. We can also use any hashtag, such as #NewYearResolutions, #2016Resolutions, or a combination of hashtags to get relevant tweets. The following piece of code not only extracts tweets, but also converts the list of tweet/status objects into an R data frame. We also convert each of the tweets to UTF-8 to handle text from different languages.

Note

Amazing fact: Twitter is available in 48 different languages and counting!

# trending tweets
trendingTweets = searchTwitter("#ResolutionsFor2016",n=1000)
trendingTweets.df = twListToDF(trendingTweets)
trendingTweets.df$text <- sapply(trendingTweets.df$text,function(x) iconv(x,to='UTF-8'))

As we saw in the previous section, a tweet contains far more information than mere text. One of the various attributes is the status source. The status source denotes the device from where the tweet was made. It may be a mobile phone, tablet, and so on. Before we apply major transformations and clean up tweet objects, we apply a quick transformation to transform status source to meaningful form:

trendingTweets.df$tweetSource = sapply(trendingTweets.df$statusSource,function(sourceSystem) enodeSource(sourceSystem))

The preceding code transforms statusSource from values such as <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> to simply Android and assigns it to a new attribute named tweetSource.

Once we have the data, the next set of steps in the data mining process is to clean up the data. We use the text mining package tm to perform transformation and cleanup. The Corpus function in particular helps us handle tweet/status objects as a collection of documents. We then use the tm_map utility from the same package to apply/map transformations such as converting all text to lower case, removing punctuation, numbers, and stop words. Stop words is a list of the most commonly used words, such as a, an, the, and so on, which can safely be removed while analyzing text without loss of meaning.

# transformations
tweetCorpus <- Corpus(VectorSource(trendingTweets.df$text))
tweetCorpus <- tm_map(tweetCorpus, tolower)
tweetCorpus <- tm_map(tweetCorpus, removePunctuation)
tweetCorpus <- tm_map(tweetCorpus, removeNumbers)

# remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
tweetCorpus <- tm_map(tweetCorpus, removeURL) 

# remove stop words
twtrStopWords <- c(stopwords("english"),'resolution','resolutions','resolutionsfor','resolutionsfor2016','2016','new','year','years','newyearresolution')
tweetCorpus <- tm_map(tweetCorpus, removeWords, twtrStopWords)

tweetCorpus <- tm_map(tweetCorpus, PlainTextDocument)

The final transformation before we proceed to the next step of analyzing our data for hidden patterns/insights is a term-document matrix. As the name itself says, a term-document matrix is a matrix representation in which terms act as rows while columns are represented by documents. Each entry in this matrix represents the number of occurrences of a term in a given document. More formally, a term-document matrix is a matrix representation that describes the frequency of terms in a collection of documents. This representation is extremely useful in natural language processing applications. It is an optimized data structure that enables quick searches, topical modeling, and more. The data structure can be explained using the following simple example where we have two text documents, TD1 and TD2:

Twitter data mining

Sample term-document matrix

The tm package provides us another easy-to-use utility called term-document matrix (TermDocumentMatrix is also available), which we use to convert our Corpus object into the required form:

# Term Document Matrix
> twtrTermDocMatrix <- TermDocumentMatrix(tweetCorpus, control = list(minWordLength = 1))

Frequent words and associations

The term-document matrix thus prepared contains words from each of the tweets (post the cleanup and transformations) as rows, while columns represent the tweet themselves.

As a quick check, let us see which of the words are most frequently used in our dataset. Let the threshold be set to 30 occurrences or more. We use the apply utility to iterate each term in our term-document matrix and sum its occurrences. The function helps us filter out the terms that have appeared 30 times or more.

# Terms occuring in more than 30 times
> which(apply(twtrTermDocMatrix,1,sum)>=30)

The result will be as shown in the following screenshot:

Frequent words and associations

Terms with 30 or more occurrences across tweets

As the preceding screenshot shows, words such as healthy, inspire, and positivity feature in the list of words with 30 or more occurrences. We all have a lot in common when it comes to yearly goals, no?

The preceding manipulation was a quick check to see if we really have tweets that help us find out something interesting about New Year's resolutions. Let us now take a formal approach and identify frequent terms in our data set. We will also try and present the information in a creative yet easy-to-understand representation. To get the most frequent terms in our data set, we use the function findFreqTerms from the tm package again. This function provides us an abstraction over quick hacks, such as the one we previously used. findFreqTerms also lets us set a minimum and maximum threshold for term frequencies. For our case, we will only mention the lower bound and see the results:

# print the frequent terms from termdocmatrix
> (frequentTerms<-findFreqTerms(twtrTermDocMatrix,lowfreq = 10))

The results look something like the following screenshot:

Frequent words and associations

We get about 107 terms with a minimum occurrence of 10. If you look carefully, the terms we saw with frequencies of at least 30 also appear in this list, and rightly so.

Now that we are certain that there are terms/words with occurrences of more than 10, let us create a data frame and plot the terms versus their frequencies as we decided previously. We use the rowSums function to calculate the total occurrence of each term/word. We then pick a subset of terms which have more than 10 occurrences and plot them using ggplot:

# calculate frequency of each term
term.freq <- rowSums(as.matrix(twtrTermDocMatrix))

# picking only a subset
subsetterm.freq <- subset(term.freq, term.freq >= 10)


# create data frame from subset of terms
frequentTermsSubsetDF <- data.frame(term = names(subsetterm.freq), freq = subsetterm.freq)

# create data frame with all terms
frequentTermsDF <- data.frame(term = names(term.freq), freq = term.freq)

# sort by subset DataFrame frequency
frequentTermsSubsetDF <- frequentTermsSubsetDF[with(frequentTermsSubsetDF, order(-frequentTermsSubsetDF$freq)), ]

# sort by complete DataFrame frequency
frequentTermsDF <- frequentTermsDF[with(frequentTermsDF, order(-frequentTermsDF$freq)), ]

# words by frequency from subset data frame
ggplot(frequentTermsSubsetDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") +xlab("Terms") + ylab("Frequency") + coord_flip()

The preceding piece of code generates the following frequency graph:

Frequent words and associations

Upon analyzing the preceding graph, we can quickly get some interesting points:

  • The words mom, elected, president, and trillionaire feature in the top 10. Strange set, yet interesting. More on this in a bit.
  • Health features high in the list, but doesn't make it to the top 10. So, it seems like health is on the cards but not very high. This is the same for fitness and diet.
  • Most of the words in this list are positive in essence. Words such as happy, hope, positivity, change, and so on all point to the upbeat mood while taking up New Year's resolutions!

Though the preceding graph gives us quite a lot of information regarding the words and their frequencies in a nice layout, it still doesn't show us the full picture. Remember that we crafted a subset of items from our data set before generating this graph? We did that on purpose, otherwise the graph would have become too long and words with lesser frequencies would clutter the whole thing. Another point which this graph misses out is the relative difference in the frequencies.

If our aim is to see the relative difference between the frequencies, we need a different visualization altogether. Here comes word cloud to the rescue. Using the wordcloud library, we can easily generate word clouds from a dataframe using a one liner:

# wordcloud
> wordcloud(words=frequentTermsDF$term, freq=frequentTermsDF$freq,random.order=FALSE)

The wordcloud using the complete data frame looks something like this:

Frequent words and associations

The preceding word cloud renders words in decreasing order of frequency. The size of each word emphasizes its frequency. You can play around with the wordcloud function to generate some interesting visualizations or even art!

A lot of words appear in the preceding graphs, but isn't it rather interesting to see the word trillionaire pop up in the top 10? What could be the reason for it? Was it a spam post by a bot, or a tweet by some celebrity that went viral, or something completely different altogether? Let's check out the top tweet in this list and see if it contains the word trillionaire:

# top retweets
> head(subset(trendingTweets.df$text, grepl("trillionaire",trendingTweets.df$text) ),n=1)

The following screenshot is what you get:

Frequent words and associations

It turns out that our hunch was right. It was a New Year resolution tweet by a celebrity that went viral. A quick search on Twitter reveals the tweet:

A bit further searching reveals Misha Collins is a famous actor from the television series Supernatural. We can also see that the above resolution was retweeted a staggering 5k times! It's interesting to note that the number of likes is 14k, outnumbering the retweets. Can we infer that tweeple prefer likes/hearts to retweets? It can also be seen that words such as mom, learn, trillionaire, elected, and President all occur as most frequent words without a doubt. Indirectly, we can also infer that Supernatural has a huge fan following on Twitter and that Castiel (Misha's role in the TV series) is a popular character from the show. A bit of a surprise is his resolution to learn to crochet, no?

Moving on from supernatural stuff, let us go back to the fitness debate. Fitness is important to most of us. Activities such as exercising or hitting the gym see a surge during the initial months/weeks of the year. Let's see how health-conscious our friends on Twitter are!

Since a lot of words such as health, diet, fitness, gym, and so on point towards a healthy lifestyle, let us try and find words associated with the word fitness itself. findAssocs is a handy function which helps us find words from a term-document matrix that have at least a specified level of correlation to a given word. We will use the output from this function to prepare a term-association (correlation) graph using ggplot. The process is similar to how we prepared the preceding frequency graph:

# Associatons
(fitness.associations <- findAssocs(twtrTermDocMatrix,"fitness",0.25))

fitnessTerm.freq <- rowSums(as.matrix(fitness.associations$fitness))

fitnessDF <- data.frame(term=names(fitnessTerm.freq),freq=fitnessTerm.freq)

fitnessDF <- fitnessDF[with(fitnessDF, order(-fitnessDF$freq)), ]
ggplot(fitnessDF,aes(x=reorder(term,freq),y=freq))
+geom_bar(stat = "identity") +xlab("Terms")
+ ylab("Associations")
+ coord_flip()

The words most closely correlated to the word fitness are as follows:

Frequent words and associations

The same data is more readable in graphical form, as follows:

Frequent words and associations

As evident from the preceding graph, terms such as lossweight, workout, getfit, and so on. prove our point that tweeple are as concerned about health as we are. It is interesting to note the occurrence of the term yogavideos in this list. It looks like yoga is catching up with other techniques of staying fit in 2016. There's meditation on the list too.

Popular devices

So far, we have dealt with the visible components of a tweet, such as the text, retweet counts, and so on, and we were able to extract many interesting insights. Let us take out our precision tools and dig deeper into our data.

As mentioned a couple times in the above sections, a tweet has far more information than what meets the eye. One such piece of information is about the source of the tweet. Twitter was born of the SMS era, and many of its characteristics, such as the 140 character word limit, are reminiscent of that era. It would be interesting to see how tweeple use Twitter, that is, what devices are used to access and post on Twitter frequently. Though the world has moved a long way from the SMS era, mobile phones are ubiquitous. To get this information, we will make use of the attribute tweetSource from our dataframe trendingTweets.df. We created this additional attribute from the statusSource attribute already existing in the tweet object (see the beginning of this section for a quick recap).

We shall use a subset of the data frame trendingTweets.df based upon retweet counts for the sake of clarity. We will use ggplot again to visualize our results.

# Source by retweet count
trendingTweetsSubset.df <- subset(trendingTweets.df, trendingTweets.df$retweetCount >= 5000 )

ggplot(trendingTweetsSubset.df, aes(x =tweetSource, y =retweetCount/100)) + geom_bar(stat = "identity") +xlab("Source") + ylab("Retweet Count")

The following plot is your result:

Popular devices

Without a doubt, the iPhone is the most preferred device, followed by Android and the Web. It is interesting to see that people use the Web/website to retweet more than the iPad! Windows Phone clearly has some serious issues to tackle here. Can we also infer that the iPhone is the preferred device amongst tweeples? Or does the iPhone provide a better experience than any other device for Twitter? Or we could even go deeper and say that Twitter on iPhone has an easier-to-access "retweets" button than any other device. Inferences such as these and many more, require a bit more digging than this, but all of this has a lot of knowledge/potential that could be used by managements, UX teams, and so on to improve and change things around.

Hierarchical clustering

We have seen clustering and classification in previous chapters (see Chapter 2, Let's Help Machines Learn) and uncovered some interesting facts about the data at hand. For our current use case, even though our tweets are all related to 2016 resolutions, we can never be sure of the kinds of resolutions tweeple make. This makes it a very apt use case for hierarchical clustering. Unlike k-means or other clustering algorithms that require a preset number of clusters before computation, hierarchical clustering algorithms work independently of it.

Let us take this opportunity to understand hierarchical clustering before we apply it to our data. Hierarchical clustering, like any other clustering algorithm, helps us group similar items together. The exact details for this algorithm in general can be explained as follows:

  • Initialize: This is the first step, where each element is assigned to a cluster of its own. For a dataset containing n elements, the algorithm creates n different clusters with one element in each of them. A distance/similarity measure is decided at this step.
  • Merge: During this step, depending upon the distance/similarity measure chosen, the closest pair of clusters are identified and merged into a single cluster. This step results in one fewer clusters than the total clusters so far.
  • Compute/recompute: We compute/recompute distances/similarities between the new cluster formed in the Merge step and the existing clusters.

The merge and compute steps are repeated until we are left with a single cluster containing all n items. As the name suggests, this algorithm generates a hierarchical structure with the leaves denoting individual elements as clusters combined based upon similarity/distance as we go toward the root of the tree. The output tree is generally referred to as a dendrogram.

The merge step is where variations of this algorithm exist. There are several ways in which the closest clusters could be identified. From simple methods, such as single-link, which consider the shortest distance between any two elements of the two clusters in consideration as the distance measure, to complex ones such as Ward's method which uses variance to find the most compact clusters, there are several methods that could be employed depending upon the use case.

Coming back to the Twitter world, let us use hierarchical clustering to see which terms/tweets are the closest. For our current use case, we will use the single method for our merge criteria. You may try out different algorithms and observe the differences.

To perform hierarchical clustering, we first treat our dataset to remove sparse terms for the sake of clarity. For this, the removeSparseTerms function helps us remove rows of data that have sparsity below a specified limit. We then use the hclust utility to form clusters. The output of this utility is directly plottable. Let us write some code for this:

# remove sparse terms
twtrTermDocMatrix2 <- removeSparseTerms(twtrTermDocMatrix, sparse = 0.98)

tweet_matrix <- as.matrix(twtrTermDocMatrix2)

# cluster terms
distMatrix <- dist(scale(tweet_matrix))

fit <- hclust(distMatrix,method="single")
plot(fit)

The output dendrogram is amazingly simple to understand:

Hierarchical clustering

If you observe the second cluster from right, it contains terms trillionaire, elected, mom, call, and so on. Mapping back to the top retweeted tweet from Mischa Collins, all these terms are mentioned in that single tweet and our algorithm has rightly clustered them together. Smart, isn't it? As a small exercise, observe other clusters and see how the terms occur in the tweets that contain them. One important observation to make here is that the dendrogram correctly maps all frequent terms under a single root, which reaffirms that all these terms point to our central theme of 2016 resolutions!

Topic modeling

So far, our analysis has been about tweets related to New Year's resolutions from users across the world. We have analyzed tweets related to a topic of our choice. Ignoring spam and other noisy tweets, more or less, our data conformed to a single topic. The topic itself constituted a group of words (such as health, trillionaire, fitness, diet, mom, and so on) which broadly describe different resolutions. To broaden our scope of analysis and to discover even more insights, let us touch upon the concept of topic modeling.

Topic modeling is a process of discovering patterns in a corpus of unlabeled text that represents the gist of the corpus. A topic itself may be described as a group of words that occur together to describe a large body of text.

Another definition tweeted during one of the conferences on topic modeling:

The aim of topic modeling is to automatically identify the underlying theme of a corpus and thus be useful in applications that require information retrieval based on a theme but in absence of known keywords (the exact opposite of our current usage of search engines). For example, wouldn't it be amazing to learn about relations between two countries from a newspaper's archive by using the theme relations between country one and country two rather than searching for a keyword and then following link after link. Please note that following links to discover information is equally powerful, but it leaves a lot to be desired.

One of the ways to perform topic modeling is through Latent Dirichlet Allocation (LDA); it is one of the most powerful and widely used models.

LDA was presented by David M Blie in his paper Introduction to Probabilistic Topic Models in 2003. LDA, as his paper says, can be defined as a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data is similar. LDA works upon the assumption that documents exhibit multiple topics.

LDA is a probabilistic model and the mathematics of it are fairly involved and beyond the scope of this book. In a nonmathematical way, LDA can be explained as a model/process that helps identify the topics that have resulted in the generation of a collection of documents.

Note

For further reading, refer to Blei's paper.

https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

A blog which explains everything in simple words:

http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/

For our purpose/use case, we can assume LDA as a model/process which helps us to identify the underlying (hidden/latent) topics from a corpus of unlabeled text. Luckily, R abstracts most of the mathematical details in the form of a library called topicmodels.

For the purpose of topic modeling, we shall use a new set of tweets. The International Space Station (ISS) has multiple Twitter handles, and one of them is @ISS_Research, which particularly caters to research related tweets from the ISS. Let us explore what @ISS_Research is up to these days by analyzing the tweets from its timeline. We will analyze these tweets to identify the underlying topics of research at the ISS. For this purpose, we will use the same process to extract tweets and perform transformations/cleanup as we have done before. The following snippet of code does this:

# set user handle
atISS <- getUser("ISS_Research")

# extract iss_research tweets
tweets <- userTimeline(atISS, n = 1000)

tweets.df=twListToDF(tweets)

tweets.df$text <- sapply(tweets.df$text,function(x) iconv(x,to='UTF-8'))

#Document Term Matrix
twtrDTM <- DocumentTermMatrix(twtrCorpus, control = list(minWordLength = 1))

Please note that the preceding snippet prepares a document-term matrix, unlike last time where we prepared a term-document matrix.

Once we have tweets in the required format, the LDA utility from the topicmodels package helps us uncover the hidden topics/patterns. The LDA utility requires the number of topics as input along with the document-term matrix. We will try eight topics for now. The following code uses LDA to extract six terms for each of the eight topics:

#topic modeling

# find 8 topics
ldaTopics <- LDA(twtrDTM, k = 8) 

#first 6 terms of every topic
ldaTerms <- terms(ldaTopics, 6) 

# concatenate terms
(ldaTerms <- apply(ldaTerms, MARGIN = 2, paste, collapse = ", "))

The list of topics generated using LDA is as follows:

Topic modeling

A visual representation would be easier to understand. We can make use of qplot to quickly plot the topics across time on an area chart, as follows:

# first topic identified for every tweet
firstTopic <- topics(ldaTopics, 1)

topics <- data.frame(date=as.Date(tweets.df$created), firstTopic)

qplot(date, ..count.., data=topics, geom="density",fill=ldaTerms[firstTopic], position="stack")+scale_fill_grey()

The generated chart looks like the following screenshot:

Topic modeling

Let us now analyze the outputs. The list of terms per topic generated by LDA seems to give us a nice insight into the kind of work/research going on at the ISS. Terms such as mars, microgravity, flower, Cygnus, and so on tell us about the main areas of research or at least the topics about which scientists/astronauts on the ISS are talking. Terms such as stationcdrkelly and astrotimpeake look more like Twitter handles.

Note

A quick exercise would be to use the current @ISS_Research timeline data and mine for the handles, such as stationcdrkelly, to discover more information. Who knows, it may turn out be a nice list of astronauts to follow!

The qplot output adds the time dimension to our plain list of topics. Analyzing topics across the time dimension helps us understand when a particular research topic was discussed or when something amazing was announced. Topic two in the list, or the fourth one from the top in the graph legend comprises the word flower. Since scientists were successful in blooming some orange flowers in space recently, the above graph helps us get an idea that the news first broke on Twitter on/around 15th January. A quick look on Twitter/news websites confirms that the news broke by tweet on 18th January 2016…close enough!

Tip

Colorful area charts

Try removing the option scale_fill_grey() from qplot to get some beautiful charts that are far easier to read than plain gray scale.

So, finally we learnt about topic modeling using LDA on data from the ISS and found what amazing things scientists and astronauts are doing up there in outer space.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset