We are now ready to proceed with building the wordclouds which will give us a sense of the important words carried in those tweets. We will create wordclouds for the datasets harvested. Wordclouds extract the top words in a list of words and create a scatterplot of the words where the size of the word is correlated to its frequency. The more frequent the word in the dataset, the bigger will be the font size in the wordcloud rendering. They include three very different themes and two competing or analogous entities. Our first theme is obviously data processing and analytics, with Apache Spark and Python as our entities. Our second theme is the 2016 presidential election campaign, with the two contenders: Hilary Clinton and Donald Trump. Our last theme is the world of pop music with Justin Bieber and Lady Gaga as the two exponents.
We will illustrate the programming steps by analyzing the spark related tweets. We load the data and preview the dataframe:
In [21]: import pandas as pd csv_in = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/spark_tweets.csv' tspark_df = pd.read_csv(csv_in, index_col=None, header=0, sep=',', encoding='utf-8') In [3]: tspark_df.head(3) Out[3]: id created_at user_id user_name tweet_text htag urls ptxt tgrp date user_handles txt_terms search_grp 0 638818911773856000 Tue Sep 01 21:01:11 +0000 2015 2511247075 Noor Din RT @kdnuggets: R leads RapidMiner, Python catc... [#KDN] [://t.co/3bsaTT7eUs] r leads rapidminer python catches up big data ... [spark, python] 2015-09-01 21:01:11 [@kdnuggets] r leads rapidminer python catches up big data ... [spark, python] 1 622142176768737000 Fri Jul 17 20:33:48 +0000 2015 24537879 IBM Cloudant Be one of the first to sign-up for IBM Analyti... [#ApacheSpark, #SparkInsight] [://t.co/C5TZpetVA6, ://t.co/R1L29DePaQ] be one of the first to sign up for ibm analyti... [spark] 2015-07-17 20:33:48 [] be one of the first to sign up for ibm analyti... [spark] 2 622140453069169000 Fri Jul 17 20:26:57 +0000 2015 515145898 Arno Candel Nice article on #apachespark, #hadoop and #dat... [#apachespark, #hadoop, #datascience] [://t.co/IyF44pV0f3] nice article on apachespark hadoop and datasci... [spark] 2015-07-17 20:26:57 [@h2oai] nice article on apachespark hadoop and datasci... [spark]
The wordcloud library we will use is the one developed by Andreas Mueller and hosted on his GitHub account at https://github.com/amueller/word_cloud.
The library requires PIL (short for Python Imaging Library). PIL is easily installable by invoking conda install pil
. PIL is a complex library to install and is not yet ported on Python 3.4, so we need to run a Python 2.7+ environment to be able to see our wordcloud:
# # Install PIL (does not work with Python 3.4) # an@an-VB:~$ conda install pil Fetching package metadata: .... Solving package specifications: .................. Package plan for installation in environment /home/an/anaconda:
The following packages will be downloaded:
package | build ---------------------------|----------------- libpng-1.6.17 | 0 214 KB freetype-2.5.5 | 0 2.2 MB conda-env-2.4.4 | py27_0 24 KB pil-1.1.7 | py27_2 650 KB ------------------------------------------------------------ Total: 3.0 MB
The following packages will be UPDATED:
conda-env: 2.4.2-py27_0 --> 2.4.4-py27_0 freetype: 2.5.2-0 --> 2.5.5-0 libpng: 1.5.13-1 --> 1.6.17-0 pil: 1.1.7-py27_1 --> 1.1.7-py27_2 Proceed ([y]/n)? y
Next, we install the wordcloud library:
# # Install wordcloud # Andreas Mueller # https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py # an@an-VB:~$ pip install wordcloud Collecting wordcloud Downloading wordcloud-1.1.3.tar.gz (163kB) 100% |████████████████████████████████| 163kB 548kB/s Building wheels for collected packages: wordcloud Running setup.py bdist_wheel for wordcloud Stored in directory: /home/an/.cache/pip/wheels/32/a9/74/58e379e5dc614bfd9dd9832d67608faac9b2bc6c194d6f6df5 Successfully built wordcloud Installing collected packages: wordcloud Successfully installed wordcloud-1.1.3
At this stage, we are ready to invoke the wordcloud program with the generated list of terms from the tweet text.
Let's get started with the wordcloud program by first calling %matplotlib
inline to display the wordcloud in our notebook:
In [4]: %matplotlib inline In [11]:
We convert the dataframe txt_terms
column into a list of words. We make sure it is all converted into the str
type to avoid any bad surprises and check the list's first four records:
len(tspark_df['txt_terms'].tolist()) Out[11]: 2024 In [22]: tspark_ls_str = [str(t) for t in tspark_df['txt_terms'].tolist()] In [14]: len(tspark_ls_str) Out[14]: 2024 In [15]: tspark_ls_str[:4] Out[15]: ['r leads rapidminer python catches up big data tools grow spark ignites kdn', 'be one of the first to sign up for ibm analytics for apachespark today sparkinsight', 'nice article on apachespark hadoop and datascience', 'spark 101 running spark and mapreduce together in production hadoopsummit2015 apachespark altiscale']
We first call the Matplotlib and the wordcloud libraries:
import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS
From the input list of terms, we create a unified string of terms separated by a whitespace as the input to the wordcloud program. The wordcloud program removes stopwords:
# join tweets to a single string words = ' '.join(tspark_ls_str) # create wordcloud wordcloud = WordCloud( # remove stopwords stopwords=STOPWORDS, background_color='black', width=1800, height=1400 ).generate(words) # render wordcloud image plt.imshow(wordcloud) plt.axis('off') # save wordcloud image on disk plt.savefig('./spark_tweets_wordcloud_1.png', dpi=300) # display image in Jupyter notebook plt.show()
Here, we can visualize the wordclouds for Apache Spark and Python. Clearly, in the case of Spark, Hadoop, big data, and analytics are the memes, while Python recalls the root of its name Monty Python with a strong focus on developer, apache spark, and programming with some hints to java and ruby.
We can also get a glimpse in the following wordclouds of the words preoccupying the North American 2016 presidential election candidates: Hilary Clinton and Donald Trump. Seemingly Hilary Clinton is overshadowed by the presence of her opponents Donald Trump and Bernie Sanders, while Trump is heavily centered only on himself:
Interestingly, in the case of Justin Bieber and Lady Gaga, the word love appears. In the case of Bieber, follow and belieber are key words, while diet, weight loss, and fashion are the preoccupations for the Lady Gaga crowd.