Exploring the headlines

Let's start by creating a function we can use to examine the most common tuples. We'll set it up so that we can use it later on the body text as well. We'll do this using the Python Natural Language Toolkit (NLTK) library. This can be pip installed if you don't have it currently:

from nltk.util import ngrams 
from nltk.corpus import stopwords 
import re 
 
def get_word_stats(txt_series, n, rem_stops=False): 
    txt_words = [] 
    txt_len = [] 
    for w in txt_series: 
        if w is not None: 
            if rem_stops == False: 
                word_list = [x for x in ngrams(re.findall('[a-z0-9']+', w.lower()), n)] 
            else: 
                word_list = [y for y in ngrams([x for x in re.findall('[a-z0-9']+', w.lower()) 
                                                if x not in stopwords.words('english')], n)] 
            word_list_len = len(list(word_list)) 
            txt_words.extend(word_list) 
            txt_len.append(word_list_len) 
    return pd.Series(txt_words).value_counts().to_frame('count'), pd.DataFrame(txt_len, columns=['count']) 

There is a lot in there, so let's unpack it. We created a function that takes in a series, an integer, and a Boolean value. The integer determines the n we'll use for n-gram parsing, while the Boolean determines whether or not we exclude stop words. The function returns the number of tuples per row and the frequency for each tuple.

Let's run it on our headlines, while retaining the stop words. We'll begin with just single words:

hw,hl = get_word_stats(dfc['title'], 1, 0) 
 
hl 

This generates the following output:

Now, we have the word count for each headline. Let's see what the stats on this look like:

hl.describe() 

This code generates the following output:

We can see that the median headline length for our viral stories comes in at exactly 11 words. Let's take a look at the most frequently used words:

That is not exactly useful, but is in keeping with what we might expect. Now, let's take a look at the same information for bi-grams:

hw,hl = get_word_stats(dfc['title'], 2, 0) 
 
hw 

This generates the following output:

This is definitely more interesting. We can start to see some of the components of the headlines over and over again. The two that stand out are (donald, trump) and (dies, at). Trump makes sense as he said some headline-grabbing statements during the election, but I was surprised by the dies headlines. I took a look at the headlines, and apparently a number of high-profile people died in the year in question, so that also makes sense.

Now, let's run this with the stop words removed:

hw,hl = get_word_stats(dfc['title'], 2, 1) 
 
hw 

This generates the following output:

Again, we can see many things we might expect. It looks like if we changed how we parsed numbers (replacing each of them with a single identifier like number), we would likely see more of these bubble up. I'll leave that as an exercise to the reader, if you'd like to attempt that.

Now, let's take a look at tri-grams:

hw,hl = get_word_stats(dfc['title'], 3, 0) 

This code generates the following output:

It seems that the more words we include, the more the headlines come to resemble the classic BuzzFeed prototype. In fact, let's see whether that's the case. We haven't looked at which sites produce the most viral stories; let's see whether BuzzFeed leads the charts:

dfc['site'].value_counts().to_frame() 

This generates the following output:

We can clearly see that BuzzFeed dominates the list. In a distant second place, we can see The Huffington Post, which incidentally is another site that Jonah Peretti worked for. It appears that studying the science of virality can pay big dividends.

So far, we have examined images and headlines. Now, let's move on to examining the full text of the stories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset