Gauging words, moods, and memes at a glance

We are now ready to proceed with building the wordclouds which will give us a sense of the important words carried in those tweets. We will create wordclouds for the datasets harvested. Wordclouds extract the top words in a list of words and create a scatterplot of the words where the size of the word is correlated to its frequency. The more frequent the word in the dataset, the bigger will be the font size in the wordcloud rendering. They include three very different themes and two competing or analogous entities. Our first theme is obviously data processing and analytics, with Apache Spark and Python as our entities. Our second theme is the 2016 presidential election campaign, with the two contenders: Hilary Clinton and Donald Trump. Our last theme is the world of pop music with Justin Bieber and Lady Gaga as the two exponents.

Setting up wordcloud

We will illustrate the programming steps by analyzing the spark related tweets. We load the data and preview the dataframe:

In [21]:
import pandas as pd
csv_in = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/spark_tweets.csv'
tspark_df = pd.read_csv(csv_in, index_col=None, header=0, sep=',', encoding='utf-8')
In [3]:
tspark_df.head(3)
Out[3]:
  id   created_at   user_id   user_name   tweet_text   htag   urls   ptxt   tgrp   date   user_handles   txt_terms   search_grp
0   638818911773856000   Tue Sep 01 21:01:11 +0000 2015   2511247075   Noor Din   RT @kdnuggets: R leads RapidMiner, Python catc...   [#KDN]   [://t.co/3bsaTT7eUs]   r leads rapidminer python catches up big data ...   [spark, python]   2015-09-01 21:01:11   [@kdnuggets]   r leads rapidminer python catches up big data ...   [spark, python]
1   622142176768737000   Fri Jul 17 20:33:48 +0000 2015   24537879   IBM Cloudant   Be one of the first to sign-up for IBM Analyti...   [#ApacheSpark, #SparkInsight]   [://t.co/C5TZpetVA6, ://t.co/R1L29DePaQ]   be one of the first to sign up for ibm analyti...   [spark]   2015-07-17 20:33:48   []   be one of the first to sign up for ibm analyti...   [spark]
2   622140453069169000   Fri Jul 17 20:26:57 +0000 2015   515145898   Arno Candel   Nice article on #apachespark, #hadoop and #dat...   [#apachespark, #hadoop, #datascience]   [://t.co/IyF44pV0f3]   nice article on apachespark hadoop and datasci...   [spark]   2015-07-17 20:26:57   [@h2oai]   nice article on apachespark hadoop and datasci...   [spark]

Note

The wordcloud library we will use is the one developed by Andreas Mueller and hosted on his GitHub account at https://github.com/amueller/word_cloud.

The library requires PIL (short for Python Imaging Library). PIL is easily installable by invoking conda install pil. PIL is a complex library to install and is not yet ported on Python 3.4, so we need to run a Python 2.7+ environment to be able to see our wordcloud:

#
# Install PIL (does not work with Python 3.4)
#
an@an-VB:~$ conda install pil

Fetching package metadata: ....
Solving package specifications: ..................
Package plan for installation in environment /home/an/anaconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libpng-1.6.17              |                0         214 KB
    freetype-2.5.5             |                0         2.2 MB
    conda-env-2.4.4            |           py27_0          24 KB
    pil-1.1.7                  |           py27_2         650 KB
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following packages will be UPDATED:

    conda-env: 2.4.2-py27_0 --> 2.4.4-py27_0
    freetype:  2.5.2-0      --> 2.5.5-0     
    libpng:    1.5.13-1     --> 1.6.17-0    
    pil:       1.1.7-py27_1 --> 1.1.7-py27_2

Proceed ([y]/n)? y

Next, we install the wordcloud library:

#
# Install wordcloud
# Andreas Mueller
# https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py
#

an@an-VB:~$ pip install wordcloud
Collecting wordcloud
  Downloading wordcloud-1.1.3.tar.gz (163kB)
    100% |████████████████████████████████| 163kB 548kB/s 
Building wheels for collected packages: wordcloud
  Running setup.py bdist_wheel for wordcloud
  Stored in directory: /home/an/.cache/pip/wheels/32/a9/74/58e379e5dc614bfd9dd9832d67608faac9b2bc6c194d6f6df5
Successfully built wordcloud
Installing collected packages: wordcloud
Successfully installed wordcloud-1.1.3

Creating wordclouds

At this stage, we are ready to invoke the wordcloud program with the generated list of terms from the tweet text.

Let's get started with the wordcloud program by first calling %matplotlib inline to display the wordcloud in our notebook:

In [4]:
%matplotlib inline
In [11]:

We convert the dataframe txt_terms column into a list of words. We make sure it is all converted into the str type to avoid any bad surprises and check the list's first four records:

len(tspark_df['txt_terms'].tolist())
Out[11]:
2024
In [22]:
tspark_ls_str = [str(t) for t in tspark_df['txt_terms'].tolist()]
In [14]:
len(tspark_ls_str)
Out[14]:
2024
In [15]:
tspark_ls_str[:4]
Out[15]:
['r leads rapidminer python catches up big data tools grow spark ignites kdn',
 'be one of the first to sign up for ibm analytics for apachespark today sparkinsight',
 'nice article on apachespark hadoop and datascience',
 'spark 101 running spark and mapreduce together in production hadoopsummit2015 apachespark altiscale']

We first call the Matplotlib and the wordcloud libraries:

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

From the input list of terms, we create a unified string of terms separated by a whitespace as the input to the wordcloud program. The wordcloud program removes stopwords:

# join tweets to a single string
words = ' '.join(tspark_ls_str)

# create wordcloud 
wordcloud = WordCloud(
                      # remove stopwords
                      stopwords=STOPWORDS,
                      background_color='black',
                      width=1800,
                      height=1400
                     ).generate(words)

# render wordcloud image
plt.imshow(wordcloud)
plt.axis('off')

# save wordcloud image on disk
plt.savefig('./spark_tweets_wordcloud_1.png', dpi=300)

# display image in Jupyter notebook
plt.show()

Here, we can visualize the wordclouds for Apache Spark and Python. Clearly, in the case of Spark, Hadoop, big data, and analytics are the memes, while Python recalls the root of its name Monty Python with a strong focus on developer, apache spark, and programming with some hints to java and ruby.

Creating wordclouds

We can also get a glimpse in the following wordclouds of the words preoccupying the North American 2016 presidential election candidates: Hilary Clinton and Donald Trump. Seemingly Hilary Clinton is overshadowed by the presence of her opponents Donald Trump and Bernie Sanders, while Trump is heavily centered only on himself:

Creating wordclouds

Interestingly, in the case of Justin Bieber and Lady Gaga, the word love appears. In the case of Bieber, follow and belieber are key words, while diet, weight loss, and fashion are the preoccupations for the Lady Gaga crowd.

Creating wordclouds
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset