Chapter 6. Social Media Insight Using Naive Bayes

Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining.

We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this chapter is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets.

We will cover the following topics in this chapter:

  • Downloading data from social network APIs
  • Transformers for text
  • Naive Bayes classifier
  • Using JSON for saving and loading datasets
  • The NLTK library for extracting features from text
  • The F-measure for evaluation


Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it!

We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database.

One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do.

In this chapter, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet.

When people talk about Python, they could be talking about the following things:

  • The programming language Python
  • Monty Python, the classic comedy group
  • The snake Python
  • A make of shoe called Python

There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet.

Downloading data from a social network

We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting.

First, you'll need to sign up for a Twitter account (which is free). Go to and register an account if you do not already have one.

Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API.

You will need a key to access Twitter's data. Go to and sign in to your account.

When you are logged in, go to and click on Create New App.

Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application.

Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library.


You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at

Create a new IPython Notebook to download the data. We will create several notebooks in this chapter for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data.

First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code:

import twitter
consumer_key = "<Your Consumer Key Here>"
consumer_secret = "<Your Consumer Secret Here>"
access_token = "<Your Access Token Here>"
access_token_secret = "<Your Access Token Secret Here>"
authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret)

We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored:

import os
output_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json")

We also need the json library for saving our tweets:

import json

Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier:

t = twitter.Twitter(auth=authorization)

We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file:

with open(output_filename, 'a') as output_file:
    search_results ="python", count=100)['statuses']
    for tweet in search_results:
        if 'text' in tweet:


In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for.

Running this for a few minutes will result in 100 tweets being added to the output file.


You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!).

Loading and classifying the dataset

After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels.

The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy.

A key difference between our dataset and real JSON is that we included new lines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on.

To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves.

Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows:

import os
input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json")

As stated, we will use the json library, so import that too:

import json

We create a list that will store the tweets we received from the file:

tweets = []

We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows:

with open(input_filename) as inf:
    for line in inf:
        if len(line.strip()) == 0:

We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not.

The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled.

First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings.

We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows:

labels = []
if os.path.exists(labels_filename):
    with open(labels_filename) as inf:
        labels = json.load(inf)

Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows:

def get_next_tweet():
    return tweet_sample[len(labels)]['text']


The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user.

Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top:


The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function.

The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows:

function set_label(label){
    var kernel = IPython.notebook.kernel;
    kernel.execute("labels.append(" + label + ")");

At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier).

However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this module. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation.

The code is as follows:

function load_next_tweet(){
   var code_input = "get_next_tweet()";
   var kernel = IPython.notebook.kernel;
   var callbacks = { 'iopub' : {'output' : handle_output}};
   kernel.execute(code_input, callbacks, {silent:false});

The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this module. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows:

function handle_output(out){
   var res =["text/plain"];

Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets.

Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section.

We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line:


For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows:

<div name="tweetbox">
    Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br>
Tweet: <div id="tweet_text" value="text"></div><br>
<input type=text id="capture"></input><br>

Don't run the cell just yet!

We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags.

We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows:

$("input#capture").keypress(function(e) {
if(e.which == 48) {
}else if (e.which == 49){

All other key presses are ignored.

As a last bit of JavaScript for this chapter (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows:


After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out.

When you finish all of this, simply save the labels to the output filename we defined earlier for the class values:

with open(labels_filename, 'w') as outf:
    json.dump(labels, outf)

You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets.

This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications.

Creating a replicable dataset from Twitter

In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results.


Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare.

On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect.

Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly.

One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset.

First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows:

import os
input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json")
replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json")

We load the tweets and labels as we did in the previous notebook:

import json
tweets = []
with open(input_filename) as inf:
    for line in inf:
        if len(line.strip()) == 0:
if os.path.exists(labels_filename):
    with open(classes_filename) as inf:
        labels = json.load(inf)

Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list:

dataset = [(tweet['id'], label) for tweet, label in zip(tweets, labels)]

Finally, we save the results in our file:

with open(replicable_dataset, 'w') as outf:
    json.dump(dataset, outf)

Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this chapter, it can be found in the code bundle that comes with this course.

Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows:

import os
tweet_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_classes.json")
replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json")

Then load the tweet IDs from the file using JSON:

import json
with open(replicable_dataset) as inf:
    tweet_ids = json.load(inf)

Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data.

As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels.

The code is as follows:

actual_labels = []
label_mapping = dict(tweet_ids)

Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object:

import twitter
consumer_key = "<Your Consumer Key Here>"
consumer_secret = "<Your Consumer Secret Here>"
access_token = "<Your Access Token Here>"
access_token_secret = "<Your Access Token Secret Here>"
authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret)
t = twitter.Twitter(auth=authorization)

Iterate over each of the twitter IDs by extracting the IDs into a list using the following command:

all_ids = [tweet_id for tweet_id, label in tweet_ids]

Then, we open our output file to save the tweets:

with open(tweets_filename, 'a') as output_file:

The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets:

    for start_index in range(0, len(tweet_ids), 100):

To search by ID, we first create a string that joins all of the IDs (in this batch) together:

        id_string = ",".join(str(i) for i in all_ids[start_index:start_index+100])

Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us:

        search_results = t.statuses.lookup(_id=id_string)

Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally:

        for tweet in search_results:
            if 'text' in tweet:


As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows:


Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file:

with open(labels_filename, 'w') as outf:
    json.dump(actual_labels, outf)
