Step 3 – feature extraction

Our data frames contain raw data that we gathered from Graph API. It contains all kinds of characters that we can find in posts and comments. We have to pre-process them and perform initial information extraction to be able to understand what actual consumers say.

We define the feature extraction process as a pipeline that makes different kinds of transformation in a sequence. The goal at this stage is to extract hashtags, keywords, and noun phrases from posts and comments.

The preprocess() function cleans a raw verbatim (field message in our dataset) from white spaces, punctuation, and converts to lowercase. Then, it splits the text into tokens and returns a list of tokens:

def preprocess(text): 
 
    #Basic cleaning 
    text = text.strip() 
    text = re.sub(r'[^ws]','',text) 
    text = text.lower() 
 
    #Tokenize single comment: 
    tokens = nltk.word_tokenize(text) 
 
    return(tokens)

The get_hashtags() function uses regular expressions to extract a list of hashtags from raw messages. It is applied directly on raw verbatims to avoid losing hash symbols during the cleaning process:

def get_hashtags(text): 
    hashtags = re.findall(r"#(w+)", text) 
    return(hashtags)

Tag_tokens() uses the nltk pos_tag() function to tag a part of speech on pre-processed text. It returns a list of tokens with their respective parts of speech code. It uses the default Penn Treebank tagset. The results of this step will be used to extract information from sentence structure:

def tag_tokens(preprocessed_tokens): 
    pos = nltk.pos_tag(preprocessed_tokens) 
    return(pos)

The Get_keywords() function uses the outputs of tagging and returns a list of parts of speech selected by user. In our case it might be nouns, verbs, adjectives or all of them at the same time. These parts of speech are the most insightful ones, but it is very easy to have another one, such as adverb or preposition, depending on the goal of analysis:

def get_keywords(tagged_tokens,pos='all'): 
 
    if(pos == 'all'): 
        lst_pos = ('NN','JJ','VB') 
    elif(pos == 'nouns'): 
        lst_pos = 'NN' 
    elif(pos == 'verbs'): 
        lst_pos = 'VB' 
    elif(pos == 'adjectives'): 
        lst_pos = 'JJ' 
    else: 
        lst_pos = ('NN','JJ','VB') 
 
    keywords = [tup[0] for tup in tagged_tokens if tup[1].startswith(lst_pos)] 
 
    return(keywords)

The last function that we will use at this stage of analysis aims to extract noun phrases. A noun phrase is defined as a phrase that has a noun as its head word. We can very often get interesting insights about what people talk about on the web by extracting this kind of syntactic structure.

For this purpose, we first define a pattern that has an optional DT (determiner) and multiple adjectives (JJ) and nouns (NN)

We parse the text with our pattern and then extract noun phrases that contain more than one word:

def get_noun_phrases(tagged_tokens): 
 
    grammar = "NP: {<DT>?<JJ>*<NN>}" 
    cp = nltk.RegexpParser(grammar) 
    tree = cp.parse(tagged_tokens) 
 
    result = [] 
    for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'): 
        ###We only take phrases not single words 
        if(len(subtree.leaves())>1): 
            outputs = [tup[0] for tup in subtree.leaves()] 
            outputs = " ".join(outputs) 
            result.append(outputs) 
 
    return(result)

We execute the whole pipeline in order to create two data frames with extracted hashtags, keywords, and noun_phrases. We apply previously defined functions on all the posts and comments:

def execute_pipeline(dataframe): 
    # #Get hashtags 
    dataframe['hashtags'] = dataframe.apply(lambda x: get_hashtags(x['message']),axis=1) 
    #Pre-process 
    dataframe['preprocessed'] = dataframe.apply(lambda x: preprocess(x['message']),axis=1) 
    #Extract pos 
    dataframe['tagged'] = dataframe.apply(lambda x: tag_tokens(x['preprocessed']),axis=1) 
    #Extract keywords 
    dataframe['keywords'] = dataframe.apply(lambda x: get_keywords(x['tagged'],'all'),axis=1) 
    #Extract noun_phrases 
    dataframe['noun_phrases'] = dataframe.apply(lambda x: get_noun_phrases(x['tagged']),axis=1) 
 
    return(dataframe) 
 
df_posts = execute_pipeline(df_posts) 
df_comments = execute_pipeline(df_comments)

As a result we obtain two data frames: df_posts and df_comments, which contain all the information required for further steps.

We can now start the most exciting part of the workflow—the analysis itself.

Table of Contents for Step 3 – feature extraction

Create new playlist

Sign In

Sign Up

Table of Contents for
Step 3 – feature extraction