Preprocessing the data for visualization

Before jumping into the visualizations, we will do some preparatory work on the data harvested:

In [16]:
# Read harvested data stored in csv in a Panda DF
import pandas as pd
csv_in = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'
pddf_in = pd.read_csv(csv_in, index_col=None, header=0, sep=';', encoding='utf-8')
In [20]:
print('tweets pandas dataframe - count:', pddf_in.count())
print('tweets pandas dataframe - shape:', pddf_in.shape)
print('tweets pandas dataframe - colns:', pddf_in.columns)
('tweets pandas dataframe - count:', Unnamed: 0    7540
id            7540
created_at    7540
user_id       7540
user_name     7538
tweet_text    7540
dtype: int64)
('tweets pandas dataframe - shape:', (7540, 6))
('tweets pandas dataframe - colns:', Index([u'Unnamed: 0', u'id', u'created_at', u'user_id', u'user_name', u'tweet_text'], dtype='object'))

For the purpose of our visualization activity, we will use a dataset of 7,540 tweets. The key information is stored in the tweet_text column. We preview the data stored in the dataframe calling the head() function on the dataframe:

In [21]:
pddf_in.head()
Out[21]:
  Unnamed: 0   id   created_at   user_id   user_name   tweet_text
0   0   638830426971181057   Tue Sep 01 21:46:57 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: 9_A_6: dreamint...
1   1   638830426727911424   Tue Sep 01 21:46:57 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: PhuketDailyNews...
2   2   638830425402556417   Tue Sep 01 21:46:56 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: 9_A_6: ernestsg...
3   3   638830424563716097   Tue Sep 01 21:46:56 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: PhuketDailyNews...
4   4   638830422256816132   Tue Sep 01 21:46:56 +0000 2015   3276255125   True Equality   ernestsgantt: elsahel12: 9_A_6: dreamintention...

We will now create some utility functions to clean up the tweet text and parse the twitter date. First, we import the Python regular expression regex library re and the time library to parse dates and time:

In [72]:
import re
import time

We create a dictionary of regex that will be compiled and then passed as function:

  • RT: The first regex with key RT looks for the keyword RT at the beginning of the tweet text:
    re.compile(r'^RT'),
  • ALNUM: The second regex with key ALNUM looks for words including alphanumeric characters and underscore sign preceded by the @ symbol in the tweet text:
    re.compile(r'(@[a-zA-Z0-9_]+)'),
  • HASHTAG: The third regex with key HASHTAG looks for words including alphanumeric characters preceded by the # symbol in the tweet text:
    re.compile(r'(#[wd]+)'),
  • SPACES: The fourth regex with key SPACES looks for blank or line space characters in the tweet text:
    re.compile(r's+'), 
  • URL: The fifth regex with key URL looks for url addresses including alphanumeric characters preceded with https:// or http:// markers in the tweet text:
    re.compile(r'([https://|http://]?[a-zA-Zd/]+[.]+[a-zA-Zd/.]+)')
    In [24]:
    regexp = {"RT": "^RT", "ALNUM": r"(@[a-zA-Z0-9_]+)",
              "HASHTAG": r"(#[wd]+)", "URL": r"([https://|http://]?[a-zA-Zd/]+[.]+[a-zA-Zd/.]+)",
              "SPACES":r"s+"}
    regexp = dict((key, re.compile(value)) for key, value in regexp.items())
    In [25]:
    regexp
    Out[25]:
    {'ALNUM': re.compile(r'(@[a-zA-Z0-9_]+)'),
     'HASHTAG': re.compile(r'(#[wd]+)'),
     'RT': re.compile(r'^RT'),
     'SPACES': re.compile(r's+'),
     'URL': re.compile(r'([https://|http://]?[a-zA-Zd/]+[.]+[a-zA-Zd/.]+)')}

We create a utility function to identify whether a tweet is a retweet or an original tweet:

In [77]:
def getAttributeRT(tweet):
    """ see if tweet is a RT """
    return re.search(regexp["RT"], tweet.strip()) != None

Then, we extract all user handles in a tweet:

def getUserHandles(tweet):
    """ given a tweet we try and extract all user handles"""
    return re.findall(regexp["ALNUM"], tweet)

We also extract all hashtags in a tweet:

def getHashtags(tweet):
    """ return all hashtags"""
    return re.findall(regexp["HASHTAG"], tweet)

Extract all URL links in a tweet as follows:

def getURLs(tweet):
    """ URL : [http://]?[w.?/]+"""
    return re.findall(regexp["URL"], tweet)

We strip all URL links and user handles preceded by @ sign in a tweet text. This function will be the basis of the wordcloud we will build soon:

def getTextNoURLsUsers(tweet):
    """ return parsed text terms stripped of URLs and User Names in tweet text
        ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z 	])|(w+://S+)"," ",x).split()) """
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z 	])|(w+://S+)|(RT)"," ", tweet).lower().split())

We label the data so we can create groups of datasets for the wordcloud:

def setTag(tweet):
    """ set tags to tweet_text based on search terms from tags_list"""
    tags_list = ['spark', 'python', 'clinton', 'trump', 'gaga', 'bieber']
    lower_text = tweet.lower()
    return filter(lambda x:x.lower() in lower_text,tags_list)

We parse the twitter date in the yyyy-mm-dd hh:mm:ss format:

def decode_date(s):
    """ parse Twitter date into format yyyy-mm-dd hh:mm:ss"""
    return time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(s,'%a %b %d %H:%M:%S +0000 %Y'))

We preview the data prior to processing:

In [43]:
pddf_in.columns
Out[43]:
Index([u'Unnamed: 0', u'id', u'created_at', u'user_id', u'user_name', u'tweet_text'], dtype='object')
In [45]:
# df.drop([Column Name or list],inplace=True,axis=1)
pddf_in.drop(['Unnamed: 0'], inplace=True, axis=1)
In [46]:
pddf_in.head()
Out[46]:
  id   created_at   user_id   user_name   tweet_text
0   638830426971181057   Tue Sep 01 21:46:57 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: 9_A_6: dreamint...
1   638830426727911424   Tue Sep 01 21:46:57 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: PhuketDailyNews...
2   638830425402556417   Tue Sep 01 21:46:56 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: 9_A_6: ernestsg...
3   638830424563716097   Tue Sep 01 21:46:56 +0000 2015   3276255125   True Equality   ernestsgantt: BeyHiveInFrance: PhuketDailyNews...
4   638830422256816132   Tue Sep 01 21:46:56 +0000 2015   3276255125   True Equality   ernestsgantt: elsahel12: 9_A_6: dreamintention...

We create new dataframe columns by applying the utility functions described. We create a new column for htag, user handles, URLs, the text terms stripped from URLs, and unwanted characters and the labels. We finally parse the date:

In [82]:
pddf_in['htag'] = pddf_in.tweet_text.apply(getHashtags)
pddf_in['user_handles'] = pddf_in.tweet_text.apply(getUserHandles)
pddf_in['urls'] = pddf_in.tweet_text.apply(getURLs)
pddf_in['txt_terms'] = pddf_in.tweet_text.apply(getTextNoURLsUsers)
pddf_in['search_grp'] = pddf_in.tweet_text.apply(setTag)
pddf_in['date'] = pddf_in.created_at.apply(decode_date)

The following code gives a quick snapshot of the newly generated dataframe:

In [83]:
pddf_in[2200:2210]
Out[83]:
  id   created_at   user_id   user_name   tweet_text   htag   urls   ptxt   tgrp   date   user_handles   txt_terms   search_grp
2200   638242693374681088   Mon Aug 31 06:51:30 +0000 2015   19525954   CENATIC   El impacto de @ApacheSpark en el procesamiento...   [#sparkSpecial]   [://t.co/4PQmJNuEJB]   el impacto de en el procesamiento de datos y e...   [spark]   2015-08-31 06:51:30   [@ApacheSpark]   el impacto de en el procesamiento de datos y e...   [spark]
2201   638238014695575552   Mon Aug 31 06:32:55 +0000 2015   51115854   Nawfal   Real Time Streaming with Apache Spark
http://...   [#IoT, #SmartMelboune, #BigData, #Apachespark]   [://t.co/GW5PaqwVab]   real time streaming with apache spark iot smar...   [spark]   2015-08-31 06:32:55   []   real time streaming with apache spark iot smar...   [spark]
2202   638236084124516352   Mon Aug 31 06:25:14 +0000 2015   62885987   Mithun Katti   RT @differentsachin: Spark the flame of digita...   [#IBMHackathon, #SparkHackathon, #ISLconnectIN...   []   spark the flame of digital india ibmhackathon ...   [spark]   2015-08-31 06:25:14   [@differentsachin, @ApacheSpark]   spark the flame of digital india ibmhackathon ...   [spark]
2203   638234734649176064   Mon Aug 31 06:19:53 +0000 2015   140462395   solaimurugan v   Installing @ApacheMahout with @ApacheSpark 1.4...   []   [1.4.1, ://t.co/3c5dGbfaZe.]   installing with 1 4 1 got many more issue whil...   [spark]   2015-08-31 06:19:53   [@ApacheMahout, @ApacheSpark]   installing with 1 4 1 got many more issue whil...   [spark]
2204   638233517307072512   Mon Aug 31 06:15:02 +0000 2015   2428473836   Ralf Heineke   RT @RomeoKienzler: Join me @velocityconf on #m...   [#machinelearning, #devOps, #Bl]   [://t.co/U5xL7pYEmF]   join me on machinelearning based devops operat...   [spark]   2015-08-31 06:15:02   [@RomeoKienzler, @velocityconf, @ApacheSpark]   join me on machinelearning based devops operat...   [spark]
2205   638230184848687106   Mon Aug 31 06:01:48 +0000 2015   289355748   Akim Boyko   RT @databricks: Watch live today at 10am PT is...   []   [1.5, ://t.co/16cix6ASti]   watch live today at 10am pt is 1 5 presented b...   [spark]   2015-08-31 06:01:48   [@databricks, @ApacheSpark, @databricks, @pwen...   watch live today at 10am pt is 1 5 presented b...   [spark]
2206   638227830443110400   Mon Aug 31 05:52:27 +0000 2015   145001241   sachin aggarwal   Spark the flame of digital India @ #IBMHackath...   [#IBMHackathon, #SparkHackathon, #ISLconnectIN...   [://t.co/C1AO3uNexe]   spark the flame of digital india ibmhackathon ...   [spark]   2015-08-31 05:52:27   [@ApacheSpark]   spark the flame of digital india ibmhackathon ...   [spark]
2207   638227031268810752   Mon Aug 31 05:49:16 +0000 2015   145001241   sachin aggarwal   RT @pravin_gadakh: Imagine, innovate and Igni...   [#IBMHackathon, #ISLconnectIN2015]   []   gadakh imagine innovate and ignite digital ind...   [spark]   2015-08-31 05:49:16   [@pravin_gadakh, @ApacheSpark]   gadakh imagine innovate and ignite digital ind...   [spark]
2208   638224591920336896   Mon Aug 31 05:39:35 +0000 2015   494725634   IBM Asia Pacific   RT @sachinparmar: Passionate about Spark?? Hav...   [#IBMHackathon, #ISLconnectIN]   [India..]   passionate about spark have dreams of clean sa...   [spark]   2015-08-31 05:39:35   [@sachinparmar]   passionate about spark have dreams of clean sa...   [spark]
2209   638223327467692032   Mon Aug 31 05:34:33 +0000 2015   3158070968   Open Source India   "Game Changer" #ApacheSpark speeds up #bigdata...   [#ApacheSpark, #bigdata]   [://t.co/ieTQ9ocMim]   game changer apachespark speeds up bigdata pro...   [spark]   2015-08-31 05:34:33   []   game changer apachespark speeds up bigdata pro...   [spark]

We save the processed information in a CSV format. We have 7,540 records and 13 columns. In your case, the output will vary according to the dataset you chose:

In [84]:
f_name = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/unq_tweets_processed.csv'
pddf_in.to_csv(f_name, sep=';', encoding='utf-8', index=False)
In [85]:
pddf_in.shape
Out[85]:
(7540, 13)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset