Before jumping into the visualizations, we will do some preparatory work on the data harvested:
In [16]: # Read harvested data stored in csv in a Panda DF import pandas as pd csv_in = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv' pddf_in = pd.read_csv(csv_in, index_col=None, header=0, sep=';', encoding='utf-8') In [20]: print('tweets pandas dataframe - count:', pddf_in.count()) print('tweets pandas dataframe - shape:', pddf_in.shape) print('tweets pandas dataframe - colns:', pddf_in.columns) ('tweets pandas dataframe - count:', Unnamed: 0 7540 id 7540 created_at 7540 user_id 7540 user_name 7538 tweet_text 7540 dtype: int64) ('tweets pandas dataframe - shape:', (7540, 6)) ('tweets pandas dataframe - colns:', Index([u'Unnamed: 0', u'id', u'created_at', u'user_id', u'user_name', u'tweet_text'], dtype='object'))
For the purpose of our visualization activity, we will use a dataset of 7,540 tweets. The key information is stored in the tweet_text
column. We preview the data stored in the dataframe calling the head()
function on the dataframe:
In [21]: pddf_in.head() Out[21]: Unnamed: 0 id created_at user_id user_name tweet_text 0 0 638830426971181057 Tue Sep 01 21:46:57 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: 9_A_6: dreamint... 1 1 638830426727911424 Tue Sep 01 21:46:57 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: PhuketDailyNews... 2 2 638830425402556417 Tue Sep 01 21:46:56 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: 9_A_6: ernestsg... 3 3 638830424563716097 Tue Sep 01 21:46:56 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: PhuketDailyNews... 4 4 638830422256816132 Tue Sep 01 21:46:56 +0000 2015 3276255125 True Equality ernestsgantt: elsahel12: 9_A_6: dreamintention...
We will now create some utility functions to clean up the tweet text and parse the twitter date. First, we import the Python regular expression regex library re
and the time library to parse dates and time:
In [72]: import re import time
We create a dictionary of regex that will be compiled and then passed as function:
RT
looks for the keyword RT
at the beginning of the tweet text:re.compile(r'^RT'),
ALNUM
looks for words including alphanumeric characters and underscore sign preceded by the @
symbol in the tweet text:re.compile(r'(@[a-zA-Z0-9_]+)'),
HASHTAG
looks for words including alphanumeric characters preceded by the #
symbol in the tweet text:re.compile(r'(#[wd]+)'),
SPACES
looks for blank or line space characters in the tweet text:re.compile(r's+'),
URL
looks for url
addresses including alphanumeric characters preceded with https://
or http://
markers in the tweet text:re.compile(r'([https://|http://]?[a-zA-Zd/]+[.]+[a-zA-Zd/.]+)') In [24]: regexp = {"RT": "^RT", "ALNUM": r"(@[a-zA-Z0-9_]+)", "HASHTAG": r"(#[wd]+)", "URL": r"([https://|http://]?[a-zA-Zd/]+[.]+[a-zA-Zd/.]+)", "SPACES":r"s+"} regexp = dict((key, re.compile(value)) for key, value in regexp.items()) In [25]: regexp Out[25]: {'ALNUM': re.compile(r'(@[a-zA-Z0-9_]+)'), 'HASHTAG': re.compile(r'(#[wd]+)'), 'RT': re.compile(r'^RT'), 'SPACES': re.compile(r's+'), 'URL': re.compile(r'([https://|http://]?[a-zA-Zd/]+[.]+[a-zA-Zd/.]+)')}
We create a utility function to identify whether a tweet is a retweet or an original tweet:
In [77]: def getAttributeRT(tweet): """ see if tweet is a RT """ return re.search(regexp["RT"], tweet.strip()) != None
Then, we extract all user handles in a tweet:
def getUserHandles(tweet): """ given a tweet we try and extract all user handles""" return re.findall(regexp["ALNUM"], tweet)
We also extract all hashtags in a tweet:
def getHashtags(tweet): """ return all hashtags""" return re.findall(regexp["HASHTAG"], tweet)
Extract all URL links in a tweet as follows:
def getURLs(tweet): """ URL : [http://]?[w.?/]+""" return re.findall(regexp["URL"], tweet)
We strip all URL links and user handles preceded by @
sign in a tweet text. This function will be the basis of the wordcloud we will build soon:
def getTextNoURLsUsers(tweet): """ return parsed text terms stripped of URLs and User Names in tweet text ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z ])|(w+://S+)"," ",x).split()) """ return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z ])|(w+://S+)|(RT)"," ", tweet).lower().split())
We label the data so we can create groups of datasets for the wordcloud:
def setTag(tweet): """ set tags to tweet_text based on search terms from tags_list""" tags_list = ['spark', 'python', 'clinton', 'trump', 'gaga', 'bieber'] lower_text = tweet.lower() return filter(lambda x:x.lower() in lower_text,tags_list)
We parse the twitter date in the yyyy-mm-dd hh:mm:ss
format:
def decode_date(s): """ parse Twitter date into format yyyy-mm-dd hh:mm:ss""" return time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(s,'%a %b %d %H:%M:%S +0000 %Y'))
We preview the data prior to processing:
In [43]: pddf_in.columns Out[43]: Index([u'Unnamed: 0', u'id', u'created_at', u'user_id', u'user_name', u'tweet_text'], dtype='object') In [45]: # df.drop([Column Name or list],inplace=True,axis=1) pddf_in.drop(['Unnamed: 0'], inplace=True, axis=1) In [46]: pddf_in.head() Out[46]: id created_at user_id user_name tweet_text 0 638830426971181057 Tue Sep 01 21:46:57 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: 9_A_6: dreamint... 1 638830426727911424 Tue Sep 01 21:46:57 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: PhuketDailyNews... 2 638830425402556417 Tue Sep 01 21:46:56 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: 9_A_6: ernestsg... 3 638830424563716097 Tue Sep 01 21:46:56 +0000 2015 3276255125 True Equality ernestsgantt: BeyHiveInFrance: PhuketDailyNews... 4 638830422256816132 Tue Sep 01 21:46:56 +0000 2015 3276255125 True Equality ernestsgantt: elsahel12: 9_A_6: dreamintention...
We create new dataframe columns by applying the utility functions described. We create a new column for htag
, user handles, URLs, the text terms stripped from URLs, and unwanted characters and the labels. We finally parse the date:
In [82]: pddf_in['htag'] = pddf_in.tweet_text.apply(getHashtags) pddf_in['user_handles'] = pddf_in.tweet_text.apply(getUserHandles) pddf_in['urls'] = pddf_in.tweet_text.apply(getURLs) pddf_in['txt_terms'] = pddf_in.tweet_text.apply(getTextNoURLsUsers) pddf_in['search_grp'] = pddf_in.tweet_text.apply(setTag) pddf_in['date'] = pddf_in.created_at.apply(decode_date)
The following code gives a quick snapshot of the newly generated dataframe:
In [83]: pddf_in[2200:2210] Out[83]: id created_at user_id user_name tweet_text htag urls ptxt tgrp date user_handles txt_terms search_grp 2200 638242693374681088 Mon Aug 31 06:51:30 +0000 2015 19525954 CENATIC El impacto de @ApacheSpark en el procesamiento... [#sparkSpecial] [://t.co/4PQmJNuEJB] el impacto de en el procesamiento de datos y e... [spark] 2015-08-31 06:51:30 [@ApacheSpark] el impacto de en el procesamiento de datos y e... [spark] 2201 638238014695575552 Mon Aug 31 06:32:55 +0000 2015 51115854 Nawfal Real Time Streaming with Apache Spark http://... [#IoT, #SmartMelboune, #BigData, #Apachespark] [://t.co/GW5PaqwVab] real time streaming with apache spark iot smar... [spark] 2015-08-31 06:32:55 [] real time streaming with apache spark iot smar... [spark] 2202 638236084124516352 Mon Aug 31 06:25:14 +0000 2015 62885987 Mithun Katti RT @differentsachin: Spark the flame of digita... [#IBMHackathon, #SparkHackathon, #ISLconnectIN... [] spark the flame of digital india ibmhackathon ... [spark] 2015-08-31 06:25:14 [@differentsachin, @ApacheSpark] spark the flame of digital india ibmhackathon ... [spark] 2203 638234734649176064 Mon Aug 31 06:19:53 +0000 2015 140462395 solaimurugan v Installing @ApacheMahout with @ApacheSpark 1.4... [] [1.4.1, ://t.co/3c5dGbfaZe.] installing with 1 4 1 got many more issue whil... [spark] 2015-08-31 06:19:53 [@ApacheMahout, @ApacheSpark] installing with 1 4 1 got many more issue whil... [spark] 2204 638233517307072512 Mon Aug 31 06:15:02 +0000 2015 2428473836 Ralf Heineke RT @RomeoKienzler: Join me @velocityconf on #m... [#machinelearning, #devOps, #Bl] [://t.co/U5xL7pYEmF] join me on machinelearning based devops operat... [spark] 2015-08-31 06:15:02 [@RomeoKienzler, @velocityconf, @ApacheSpark] join me on machinelearning based devops operat... [spark] 2205 638230184848687106 Mon Aug 31 06:01:48 +0000 2015 289355748 Akim Boyko RT @databricks: Watch live today at 10am PT is... [] [1.5, ://t.co/16cix6ASti] watch live today at 10am pt is 1 5 presented b... [spark] 2015-08-31 06:01:48 [@databricks, @ApacheSpark, @databricks, @pwen... watch live today at 10am pt is 1 5 presented b... [spark] 2206 638227830443110400 Mon Aug 31 05:52:27 +0000 2015 145001241 sachin aggarwal Spark the flame of digital India @ #IBMHackath... [#IBMHackathon, #SparkHackathon, #ISLconnectIN... [://t.co/C1AO3uNexe] spark the flame of digital india ibmhackathon ... [spark] 2015-08-31 05:52:27 [@ApacheSpark] spark the flame of digital india ibmhackathon ... [spark] 2207 638227031268810752 Mon Aug 31 05:49:16 +0000 2015 145001241 sachin aggarwal RT @pravin_gadakh: Imagine, innovate and Igni... [#IBMHackathon, #ISLconnectIN2015] [] gadakh imagine innovate and ignite digital ind... [spark] 2015-08-31 05:49:16 [@pravin_gadakh, @ApacheSpark] gadakh imagine innovate and ignite digital ind... [spark] 2208 638224591920336896 Mon Aug 31 05:39:35 +0000 2015 494725634 IBM Asia Pacific RT @sachinparmar: Passionate about Spark?? Hav... [#IBMHackathon, #ISLconnectIN] [India..] passionate about spark have dreams of clean sa... [spark] 2015-08-31 05:39:35 [@sachinparmar] passionate about spark have dreams of clean sa... [spark] 2209 638223327467692032 Mon Aug 31 05:34:33 +0000 2015 3158070968 Open Source India "Game Changer" #ApacheSpark speeds up #bigdata... [#ApacheSpark, #bigdata] [://t.co/ieTQ9ocMim] game changer apachespark speeds up bigdata pro... [spark] 2015-08-31 05:34:33 [] game changer apachespark speeds up bigdata pro... [spark]
We save the processed information in a CSV format. We have 7,540 records and 13 columns. In your case, the output will vary according to the dataset you chose:
In [84]: f_name = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/unq_tweets_processed.csv' pddf_in.to_csv(f_name, sep=';', encoding='utf-8', index=False) In [85]: pddf_in.shape Out[85]: (7540, 13)