Data processing

Now that the data is extracted, it's time to process it and prepare for the final analysis.

The full_data variable will contain a list of items, where a single item looks as follows:

{'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/XgRZI3UhbKRFQZGd-n-2OCOKR8A"', 
 'id': 'z13jynay3seygpvzy04cef5r2tm5ihh4d0k', 
 'kind': 'youtube#commentThread', 
 'snippet': {'canReply': False, 
  'isPublic': True, 
  'topLevelComment': {'etag':
  '"m2yskBQFythfE4irbTIeOgYYfBU/jW3EfLcy4MnIFrtFloQEjokBPXU"', 
  'id': 'z13jynay3seygpvzy04cef5r2tm5ihh4d0k', 
  'kind': 'youtube#comment', 
  'snippet': {'authorChannelId': {'value': 'UCuJkT4Bsd1qiIQoI8Rbf_hg'}, 
  'authorChannelUrl': 'http://www.youtube.com/channel/UCuJkT4Bsd1qiIQoI8Rbf_hg', 
  'authorDisplayName': 'Vince / FCB', 
  'authorProfileImageUrl': 'https://yt3.ggpht.com/-9DZP7TJ0J-
  4/AAAAAAAAAAI/AAAAAAAAAAA/dvUasDtmZFw/s28-c-k-no-mo-rj-c0xffffff/photo.jpg', 
  'canRate': False, 
  'likeCount': 0, 
  'publishedAt': '2017-05-06T17:52:51.000Z', 
  'textDisplay': 'i don't like this', 
  'textOriginal': "i don't like this", 
  'updatedAt': '2017-05-06T17:52:51.000Z', 
  'videoId': 'YQUpg795iBo', 
  'viewerRating': 'none'}}, 
  'totalReplyCount': 0, 
  'videoId': 'YQUpg795iBo'}}

We will focus on the snippet object, where we can find textOriginal and publishedAt values.

The dataframe structures are very efficient ways to store and process data. Thus, we will add relevant information to such structures as shown in the following code snippet:

import pandas as pd 
 
df = pd.DataFrame() 
 
df['comments'] = [k['snippet']['topLevelComment']['snippet']['textDisplay'] for k in full_data] 
df['date'] = [k['snippet']['topLevelComment']['snippet']['publishedAt'] for k in full_data]

In our analysis, we are going to look at the comments as a function of time. For that purpose, we will use time series functionality of data frames. It requires us to set the dataframe index as a datetime object shown as follows:

df = df.set_index(['date'])                  # sets index 
df.index = pd.to_datetime(df.index)          # converts to datetime object

The dataframe has two columns: date and comments, which contain the data gathered from the YouTube API:

date comments
2013-11-11 00:49:49 Greatness Awaits. :)
2013-11-11 00:49:48 HNNNNNNGH!!! My body isn't ready! PS4 the ...
2013-11-11 00:49:41 Epic
2013-11-11 00:49:35 Quite possibly the coolest unboxing video ever.
2013-11-11 00:47:36 GREATNESS HAS ARRIVED
2013-11-11 00:47:01 I thought they said they were leaving this to ...
2013-11-11 00:46:56 simmbaaa
2013-11-11 00:46:37 cant leave fingerprints or else hes going to jail
2013-11-11 00:46:05 4th!!!!
2013-11-11 00:45:10 WOOT WOOT

The comments are in the form as they were extracted from YouTube. Therefore, they contain many characters, punctuation, emojis, and other elements that create noise in sentiment analysis. Before we calculate sentiment analysis on each comment as explained in previous chapters, we should process the text to remove all such elements. The text processing workflow for this case consists of the following steps:

Tokenization
Conversion to lowercase
Stopwords removal
Punctuation removal
Removal of words shorter than two characters

We implement all the steps in a single function and then apply them on the whole dataset.

The language processing tasks are well handled by the NLTK library:

from nltk import word_tokenize 
from nltk.corpus import stopwords 
import string 
 
def clean(text): 
   tokens = word_tokenize(text.strip()) 
   clean = [i.lower() for i in tokens]  
   clean = [i for i in clean if i not in stopwords.words('english')] 
   clean = [i.strip(''.join(punctuations)) for i in clean if i not in list(string.punctuation)] 
   clean = [i for i in clean if len(i) > 1] 
 
   return " ".join(clean)

Such a method of cleaning provides good enough results for the kind of analysis we perform in this chapter. As an example, the function will process the post:

Greatness needs to stop awaiting and get here already! 5 more days!

Into:

greatness needs stop awaiting get already days

Now, we can apply the function to our dataset:

df['clean_comments'] = df['comment'].apply(clean) 
return(" ".join(clean))

We have now created a new column with clean comments, on which we will calculate sentiment.

The sentiment will be calculated using the NLTK Vader classifier:

from nltk.sentiment.vader import SentimentIntensityAnalyzer 
 
sentiment = SentimentIntensityAnalyzer() 
          
df['sentiment'] = df['clean_comments'].apply(lambda txt: sentiment.polarity_scores(txt)['compound'])

The new column contains compound sentiment for each comment, where values closer to +1 describe positive attitude and values closer to -1 negative, respectively.

Now we have completed all the data preparation that will allow us to compute the results for the analysis for the channel and the video for sentiment perception over time. We have seen that each time that the data extracted from the Social Media APIs requires a sufficient amount of data processing and preparation for attaining the objectives set. In this case, we needed to add a feature regarding the sentiment analysis. This could differ for different objectives.

Table of Contents for Data processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data processing