Application

In this application, we will look at predicting the gender of a writer based on their use of different words. We will use a Naive Bayes method for this, trained in MapReduce. The final model doesn't need MapReduce, although we can use the Map step to do so—that is, run the prediction model on each document in a list. This is a common Map operation for data mining in MapReduce, with the reduce step simply organizing the list of predictions so they can be tracked back to the original document.

We will be using Amazon's infrastructure to run our application, allowing us to leverage their computing resources.

Getting the data

The data we are going to use is a set of blog posts that are labeled for age, gender, industry (that is, work) and, funnily enough, star sign. This data was collected from http://blogger.com in August 2004 and has over 140 million words in more than 600,000 posts. Each blog is probably written by just one person, with some work put into verifying this (although, we can never be really sure). Posts are also matched with the date of posting, making this a very rich dataset.

To get the data, go to http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm and click on Download Corpus. From there, unzip the file to a directory on your computer.

The dataset is organized with a single blog to a file, with the filename giving the classes. For instance, one of the filenames is as follows:

1005545.male.25.Engineering.Sagittarius.xml

The filename is separated by periods, and the fields are as follows:

  • Blogger ID: This a simple ID value to organize the identities.
  • Gender: This is either male or female, and all the blogs are identified as one of these two options (no other options are included in this dataset).
  • Age: The exact ages are given, but some gaps are deliberately present. Ages present are in the (inclusive) ranges of 13-17, 23-27, and 33-48. The reason for the gaps is to allow for splitting the blogs into age ranges with gaps, as it would be quite difficult to separate an 18 year old's writing from a 19 year old, and it is possible that the age itself is a little outdated.
  • Industry: In one of 40 different industries including science, engineering, arts, and real estate. Also, included is indUnk, for unknown industry.
  • Star Sign: This is one of the 12 astrological star signs.

All values are self-reported, meaning there may be errors or inconsistencies with labeling, but are assumed to be mostly reliable—people had the option of not setting values if they wanted to preserve their privacy in those ways.

A single file is in a pseudo-XML format, containing a <Blog> tag and then a sequence of <post> tags. Each of the <post> tag is proceeded by a <date> tag as well. While we can parse this as XML, it is much simpler to parse it on a line-by-line basis as the files are not exactly well-formed XML, with some errors (mostly encoding problems). To read the posts in the file, we can use a loop to iterate over the lines.

We set a test filename so we can see this in action:

import os
filename = os.path.join(os.path.expanduser("~"), "Data", "blogs", "1005545.male.25.Engineering.Sagittarius.xml")

First, we create a list that will let us store each of the posts:

all_posts = []

Then, we open the file to read:

with open(filename) as inf:

We then set a flag indicating whether we are currently in a post. We will set this to True when we find a <post> tag indicating the start of a post and set it to False when we find the closing </post> tag;

    post_start = False

We then create a list that stores the current post's lines:

    post = []

We then iterate over each line of the file and remove white space:

    for line in inf:
        line = line.strip()

As stated before, if we find the opening <post> tag, we indicate that we are in a new post. Likewise, with the close </post> tag:

        if line == "<post>":
            post_start = True
        elif line == "</post>":
            post_start = False

When we do find the closing </post> tag, we also then record the full post that we have found so far. We also then start a new "current" post. This code is on the same indentation level as the previous line:

            all_posts.append("
".join(post))
            post = []

Finally, when the line isn't a start of end tag, but we are in a post, we add the text of the current line to our current post:

        elif post_start:
            post.append(line)

If we aren't in a current post, we simply ignore the line.

We can then grab the text of each post:

print(all_posts[0])

We can also find out how many posts this author created:

print(len(all_posts))

Naive Bayes prediction

We are now going to implement the Naive Bayes algorithm (technically, a reduced version of it, without many of the features that more complex implementations have) that is able to process our dataset.

The mrjob package

The mrjob package allows us to create MapReduce jobs that can easily be transported to Amazon's infrastructure. While mrjob sounds like a sedulous addition to the Mr. Men series of children's books, it actually stands for Map Reduce Job. It is a great package; however, as of the time of writing, Python 3 support is still not mature yet, which is true for the Amazon EMR service that we will discuss later on.

Note

You can install mrjob for Python 2 versions using the following:

sudo pip2 install mrjob

Note that pip is used for version 2, not for version 3.

In essence, mrjob provides the standard functionality that most MapReduce jobs need. Its most amazing feature is that you can write the same code, test on your local machine without Hadoop, and then push to Amazon's EMR service or another Hadoop server.

This makes testing the code significantly easier, although it can't magically make a big problem small—note that any local testing uses a subset of the dataset, rather than the whole, big dataset.

Extracting the blog posts

We are first going to create a MapReduce program that will extract each of the posts from each blog file and store them as separate entries. As we are interested in the gender of the author of the posts, we will extract that too and store it with the post.

We can't do this in an IPython Notebook, so instead open a Python IDE for development. If you don't have a Python IDE (such as PyCharm), you can use a text editor. I recommend looking for an IDE that has syntax highlighting.

Note

If you still can't find a good IDE, you can write the code in an IPython Notebook and then click on File | Download As | Python. Save this file to a directory and run it as we outlined in Chapter 11, Classifying Objects in Images using Deep Learning.

To do this, we will need the os and re libraries as we will be obtaining environment variables and we will also use a regular expression for word separation:

import os
import re

We then import the MRJob class, which we will inherit from our MapReduce job:

from mrjob.job import MRJob

We then create a new class that subclasses MRJob:

class ExtractPosts(MRJob):

We will use a similar loop, as before, to extract blog posts from the file. The mapping function we will define next will work off each line, meaning we have to track different posts outside of the mapping function. For this reason, we make post_start and post class variables, rather than variables inside the function:

    post_start = False
    post = []

We then define our mapper function—this takes a line from a file as input and yields blog posts. The lines are guaranteed to be ordered from the same per-job file. This allows us to use the above class variables to record current post data:

    def mapper(self, key, line):

Before we start collecting blog posts, we need to get the gender of the author of the blog. While we don't normally use the filename as part of MapReduce jobs, there is a strong need for it (as in this case) so the functionality is available. The current file is stored as an environment variable, which we can obtain using the following line of code:

        filename = os.environ["map_input_file"]

We then split the filename to get the gender (which is the second token):

        gender = filename.split(".")[1]

We remove whitespace from the start and end of the line (there is a lot of whitespace in these documents) and then do our post-based tracking as before;

        line = line.strip()
        if line == "<post>":
            self.post_start = True
        elif line == "</post>":
            self.post_start = False

Rather than storing the posts in a list, as we did earlier, we yield them. This allows mrjob to track the output. We yield both the gender and the post so that we can keep a record of which gender each record matches. The rest of this function is defined in the same way as our loop above:

            yield gender, repr("
".join(self.post))
            self.post = []
        elif self.post_start:
            self.post.append(line)

Finally, outside the function and class, we set the script to run this MapReduce job when it is called from the command line:

if __name__ == '__main__':
    ExtractPosts.run()

Now, we can run this MapReduce job using the following shell command. Note that we are using Python 2, and not Python 3 to run this;

python extract_posts.py <your_data_folder>/blogs/51* --output-dir=<your_data_folder>/blogposts –no-output

The first parameter, <your_data_folder>/blogs/51* (just remember to change <your_data_folder> to the full path to your data folder), obtains a sample of the data (all files starting with 51, which is only 11 documents). We then set the output directory to a new folder, which we put in the data folder, and specify not to output the streamed data. Without the last option, the output data is shown to the command line when we run it—which isn't very helpful to us and slows down the computer quite a lot.

Run the script, and quite quickly each of the blog posts will be extracted and stored in our output folder. This script only ran on a single thread on the local computer so we didn't get a speedup at all, but we know the code runs.

We can now look in the output folder for the results. A bunch of files are created and each file contains each blog post on a separate line, preceded by the gender of the author of the blog.

Extracting the blog posts

Training Naive Bayes

Now that we have extracted the blog posts, we can train our Naive Bayes model on them. The intuition is that we record the probability of a word being written by a particular gender. To classify a new sample, we would multiply the probabilities and find the most likely gender.

The aim of this code is to output a file that lists each word in the corpus, along with the frequencies of that word for each gender. The output file will look something like this:

"'ailleurs"  {"female": 0.003205128205128205}
"'air"  {"female": 0.003205128205128205}
"'an"  {"male": 0.0030581039755351682, "female": 0.004273504273504274}
"'angoisse"  {"female": 0.003205128205128205}
"'apprendra"  {"male": 0.0013047113868622459, "female": 0.0014172668603481887}
"'attendent"  {"female": 0.00641025641025641}
"'autistic"  {"male": 0.002150537634408602}
"'auto"  {"female": 0.003205128205128205}
"'avais"  {"female": 0.00641025641025641}
"'avait"  {"female": 0.004273504273504274}
"'behind"  {"male": 0.0024390243902439024}
"'bout"  {"female": 0.002034152292059272}

The first value is the word and the second is a dictionary mapping the genders to the frequency of that word in that gender's writings.

Open a new file in your Python IDE or text editor. We will again need the os and re libraries, as well as NumPy and MRJob from mrjob. We also need itemgetter, as we will be sorting a dictionary:

import os
import re
import numpy as np
from mrjob.job import MRJob
from operator import itemgetter

We will also need MRStep, which outlines a step in a MapReduce job. Our previous job only had a single step, which is defined as a mapping function and then as a reducing function. This job will have three steps where we Map, Reduce, and then Map and Reduce again. The intuition is the same as the pipelines we used in earlier chapters, where the output of one step is the input to the next step:

from mrjob.step import MRStep

We then create our word search regular expression and compile it, allowing us to find word boundaries. This type of regular expression is much more powerful than the simple split we used in some previous chapters, but if you are looking for a more accurate word splitter, I recommend using NLTK as we did in Chapter 6, Social Media Insight using Naive Bayes:

word_search_re = re.compile(r"[w']+")

We define a new class for our training:

class NaiveBayesTrainer(MRJob):

We define the steps of our MapReduce job. There are two steps. The first step will extract the word occurrence probabilities. The second step will compare the two genders and output the probabilities for each to our output file. In each MRStep, we define the mapper and reducer functions, which are class functions in this NaiveBayesTrainer class (we will write those functions next):

    def steps(self):
        return [
            MRStep(mapper=self.extract_words_mapping,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.compare_words_reducer),
            ]

The first function is the mapper function for the first step. The goal of this function is to take each blog post, get all the words in that post, and then note the occurrence. We want the frequencies of the words, so we will return 1 / len(all_words), which allows us to later sum the values for frequencies. The computation here isn't exactly correct—we need to also normalize for the number of documents. In this dataset, however, the class sizes are the same, so we can conveniently ignore this with little impact on our final version.

We also output the gender of the post's author, as we will need that later:

    def extract_words_mapping(self, key, value):
        tokens = value.split()
        gender = eval(tokens[0])
        blog_post = eval(" ".join(tokens[1:]))
        all_words = word_search_re.findall(blog_post)
        all_words = [word.lower() for word in all_words]
            all_words = word_search_re.findall(blog_post)
            all_words = [word.lower() for word in all_words]
            for word in all_words:
                yield (gender, word), 1. / len(all_words)

Tip

We used eval in the preceding code to simplify the parsing of the blog posts from the file, for this example. This is not recommended. Instead, use a format such as JSON to properly store and parse the data from the files. A malicious use with access to the dataset can insert code into these tokens and have that code run on your server.

In the reducer for the first step, we sum the frequencies for each gender and word pair. We also change the key to be the word, rather than the combination, as this allows us to search by word when we use the final trained model (although, we still need to output the gender for later use);

    def reducer_count_words(self, key, frequencies):
        s = sum(frequencies)
        gender, word = key
        yield word, (gender, s)

The final step doesn't need a mapper function, so we don't add one. The data will pass straight through as a type of identity mapper. The reducer, however, will combine frequencies for each gender under the given word and then output the word and frequency dictionary.

This gives us the information we needed for our Naive Bayes implementation:

    def compare_words_reducer(self, word, values):
        per_gender = {}
        for value in values:
            gender, s = value
            per_gender[gender] = s
        yield word, per_gender

Finally, we set the code to run this model when the file is run as a script;

if __name__ == '__main__':
    NaiveBayesTrainer.run()

We can then run this script. The input to this script is the output of the previous post-extractor script (we can actually have them as different steps in the same MapReduce job if you are so inclined);

python nb_train.py <your_data_folder>/blogposts/ 
  --output-dir=<your_data_folder>/models/ 
--no-output

The output directory is a folder that will store a file containing the output from this MapReduce job, which will be the probabilities we need to run our Naive Bayes classifier.

Putting it all together

We can now actually run the Naive Bayes classifier using these probabilities. We will do this in an IPython Notebook, and can go back to using Python 3 (phew!).

First, take a look at the models folder that was specified in the last MapReduce job. If the output was more than one file, we can merge the files by just appending them to each other using a command line function from within the models directory:

cat * > model.txt

If you do this, you'll need to update the following code with model.txt as the model filename.

Back to our Notebook, we first import some standard imports we need for our script:

import os
import re
import numpy as np
from collections import defaultdict
from operator import itemgetter

We again redefine our word search regular expression—if you were doing this in a real application, I recommend centralizing this. It is important that words are extracted in the same way for training and testing:

word_search_re = re.compile(r"[w']+")

Next, we create the function that loads our model from a given filename:

def load_model(model_filename):

The model parameters will take the form of a dictionary of dictionaries, where the first key is a word, and the inner dictionary maps each gender to a probability. We use defaultdicts, which will return zero if a value isn't present;

    model = defaultdict(lambda: defaultdict(float))

We then open the model and parse each line;

    with open(model_filename) as inf:
        for line in inf:

The line is split into two sections, separated by whitespace. The first is the word itself and the second is a dictionary of probabilities. For each, we run eval on them to get the actual value, which was stored using repr in the previous code:

            word, values = line.split(maxsplit=1)
            word = eval(word)
            values = eval(values)

We then track the values to the word in our model:

            model[word] = values
    return model

Next, we load our actual model. You may need to change the model filename—it will be in the output dir of the last MapReduce job;

model_filename = os.path.join(os.path.expanduser("~"), "models", "part-00000")
model = load_model(model_filename)

As an example, we can see the difference in usage of the word i (all words are turned into lowercase in the MapReduce jobs) between males and females:

model["i"]["male"], model["i"]["female"]

Next, we create a function that can use this model for prediction. We won't use the scikit-learn interface for this example, and just create a function instead. Our function takes the model and a document as the parameters and returns the most likely gender:

def nb_predict(model, document):

We start by creating a dictionary to map each gender to the computed probability:

    probabilities = defaultdict(lambda : 1)

We extract each of the words from the document:

    words = word_search_re.findall(document)

We then iterate over the words and find the probability for each gender in the dataset:

    for word in set(words):
        probabilities["male"] += np.log(model[word].get("male", 1e-15))
        probabilities["female"] += np.log(model[word].get("female", 1e-15))

We then sort the genders by their value, get the highest value, and return that as our prediction:

    most_likely_genders = sorted(probabilities.items(), key=itemgetter(1), reverse=True)
    return most_likely_genders[0][0]

It is important to note that we used np.log to compute the probabilities. Probabilities in Naive Bayes models are often quite small. Multiplying small values, which is necessary in many statistical values, can lead to an underflow error where the computer's precision isn't good enough and just makes the whole value 0. In this case, it would cause the likelihoods for both genders to be zero, leading to incorrect predictions.

To get around this, we use log probabilities. For two values a and b, log(a,b) is equal to log(a) + log(b). The log of a small probability is a negative value, but a relatively large one. For instance, log(0.00001) is about -11.5. This means that rather than multiplying actual probabilities and risking an underflow error, we can sum the log probabilities and compare the values in the same way (higher numbers still indicate a higher likelihood).

One problem with using log probabilities is that they don't handle zero values well (although, neither does multiplying by zero probabilities). This is due to the fact that log(0) is undefined. In some implementations of Naive Bayes, a 1 is added to all counts to get rid of this, but there are other ways to address this. This is a simple form of smoothing of the values. In our code, we just return a very small value if the word hasn't been seen for our given gender.

Back to our prediction function, we can test this by copying a post from our dataset:

new_post = """ Every day should be a half day.  Took the afternoon off to hit the dentist, and while I was out I managed to get my oil changed, too.  Remember that business with my car dealership this winter?  Well, consider this the epilogue.  The friendly fellas at the Valvoline Instant Oil Change on Snelling were nice enough to notice that my dipstick was broken, and the metal piece was too far down in its little dipstick tube to pull out.  Looks like I'm going to need a magnet.   Damn you, Kline Nissan, daaaaaaammmnnn yooouuuu....   Today I let my boss know that I've submitted my Corps application.  The news has been greeted by everyone in the company with a level of enthusiasm that really floors me.     The back deck has finally been cleared off by the construction company working on the place.  This company, for anyone who's interested, consists mainly of one guy who spends his days cursing at his crew of Spanish-speaking laborers.  Construction of my deck began around the time Nixon was getting out of office.
"""

We then predict with the following code:

nb_predict(model, new_post)

The resulting prediction, male, is correct for this example. Of course, we never test a model on a single sample. We used the file starting with 51 for training this model. It wasn't many samples, so we can't expect too high of an accuracy.

The first thing we should do is train on more samples. We will test on any file that starts with a 6 or 7 and train on the rest of the files.

In the command line and in your data folder (cd <your_data_folder), where the blogs folder exists, create a copy of the blogs data into a new folder.

Make a folder for our training set:

mkdir blogs_train

Move any file starting with a 6 or 7 into the test set, from the train set:

cp blogs/4* blogs_train/
cp blogs/8* blogs_train/

Then, make a folder for our test set:

mkdir blogs_test

Move any file starting with a 6 or 7 into the test set, from the train set:

cp blogs/6* blogs_test/
cp blogs/7* blogs_test/

We will rerun the blog extraction on all files in the training set. However, this is a large computation that is better suited to cloud infrastructure than our system. For this reason, we will now move the parsing job to Amazon's infrastructure.

Run the following on the command line, as you did before. The only difference is that we train on a different folder of input files. Before you run the following code, delete all files in the blog posts and models folders:

python extract_posts.py ~/Data/blogs_train --output-dir=/home/bob/Data/blogposts –no-output
python nb_train.py ~/Data/blogposts/ --output-dir=/home/bob/models/ --no-output

The code here will take quite a bit longer to run.

We will test on any blog file in our test set. To get the files, we need to extract them. We will use the extract_posts.py MapReduce job, but store the files in a separate folder:

python extract_posts.py ~/Data/blogs_test --output-dir=/home/bob/Data/blogposts_testing –no-output

Back in the IPython Notebook, we list all the outputted testing files:

testing_folder = os.path.join(os.path.expanduser("~"), "Data", "blogposts_testing")
testing_filenames = []
for filename in os.listdir(testing_folder):
    testing_filenames.append(os.path.join(testing_folder, filename))

For each of these files, we extract the gender and document and then call the predict function. We do this in a generator, as there are a lot of documents, and we don't want to use too much memory. The generator yields the actual gender and the predicted gender:

def nb_predict_many(model, input_filename):
    with open(input_filename) as inf:
        # remove leading and trailing whitespace
        for line in inf:
            tokens = line.split()
            actual_gender = eval(tokens[0])
            blog_post = eval(" ".join(tokens[1:]))
            yield actual_gender, nb_predict(model, blog_post)

We then record the predictions and actual genders across our entire dataset. Our predictions here are either male or female. In order to use the f1_score function from scikit-learn, we need to turn these into ones and zeroes. In order to do that, we record a 0 if the gender is male and 1 if it is female. To do this, we use a Boolean test, seeing if the gender is female. We then convert these Boolean values to int using NumPy:

y_true = []
y_pred = []
for actual_gender, predicted_gender in nb_predict_many(model, testing_filenames[0]):
    y_true.append(actual_gender == "female")
    y_pred.append(predicted_gender == "female")
y_true = np.array(y_true, dtype='int')
y_pred = np.array(y_pred, dtype='int')

Now, we test the quality of this result using the F1 score in scikit-learn:

from sklearn.metrics import f1_score
print("f1={:.4f}".format(f1_score(y_true, y_pred, pos_label=None)))

The result of 0.78 is not bad. We can probably improve this by using more data, but to do that, we need to move to a more powerful infrastructure that can handle it.

Training on Amazon's EMR infrastructure

We are going to use Amazon's Elastic Map Reduce (EMR) infrastructure to run our parsing and model building jobs.

In order to do that, we first need to create a bucket in Amazon's storage cloud. To do this, open the Amazon S3 console in your web browser by going to http://console.aws.amazon.com/s3 and click on Create Bucket. Remember the name of the bucket, as we will need it later.

Right-click on the new bucket and select Properties. Then, change the permissions, granting everyone full access. This is not a good security practice in general, and I recommend that you change the access permissions after you complete this chapter.

Left-click the bucket to open it and click on Create Folder. Name the folder blogs_train. We are going to upload our training data to this folder for processing on the cloud.

On your computer, we are going to use Amazon's AWS CLI, a command-line interface for processing on Amazon's cloud.

To install it, use the following:

sudo pip2 install awscli

Follow the instructions at http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html to set the credentials for this program.

We now want to upload our data to our new bucket. First, we want to create our dataset, which is all the blogs not starting with a 6 or 7. There are more graceful ways to do this copy, but none are cross-platform enough to recommend. Instead, simply copy all the files and then delete the ones that start with a 6 or 7, from the training dataset:

cp -R ~/Data/blogs ~/Data/blogs_train_large
rm ~/Data/blogs_train_large/6*
rm ~/Data/blogs_train_large/7*

Next, upload the data to your Amazon S3 bucket. Note that this will take some time and use quite a lot of upload data (several hundred megabytes). For those with slower Internet connections, it may be worth doing this at a location with a faster connection;

 aws s3 cp  ~/Data/blogs_train_large/ s3://ch12/blogs_train_large --recursive --exclude "*" --include "*.xml"

We are going to connect to Amazon's EMR using mrjob—it handles the whole thing for us; it only needs our credentials to do so. Follow the instructions at https://pythonhosted.org/mrjob/guides/emr-quickstart.html to setup mrjob with your Amazon credentials.

After this is done, we alter our mrjob run, only slightly, to run on Amazon EMR. We just tell mrjob to use emr using the -r switch and then set our s3 containers as the input and output directories. Even though this will be run on Amazon's infrastructure, it will still take quite a long time to run.

python extract_posts.py -r emr s3://ch12gender/blogs_train_large/ --output-dir=s3://ch12/blogposts_train/ --no-output
python nb_train.py -r emr s3://ch12/blogposts_train/ --output-dir=s3://ch12/model/ --o-output

Note

You will also be charged for the usage. This will only be a few dollars, but keep this in mind if you are going to keep running the jobs or doing other jobs on bigger datasets. I ran a very large number of jobs and was charged about $20 all up. Running just these few should be less than $4. However, you can check your balance and set up pricing alerts, by going to https://console.aws.amazon.com/billing/home.

It isn't necessary for the blogposts_train and model folders to exist—they will be created by EMR. In fact, if they exist, you will get an error. If you are rerunning this, just change the names of these folders to something new, but remember to change both commands to the same names (that is, the output directory of the first command is the input directory of the second command).

Note

If you are getting impatient, you can always stop the first job after a while and just use the training data gathered so far. I recommend leaving the job for an absolute minimum of 15 minutes and probably at least an hour. You can't stop the second job and get good results though; the second job will probably take about two to three times as long as the first job did.

You can now go back to the s3 console and download the output model from your bucket. Saving it locally, we can go back to our IPython Notebook and use the new model. We reenter the code here—only the differences are highlighted, just to update to our new model:

aws_model_filename = os.path.join(os.path.expanduser("~"), "models", "aws_model")
aws_model = load_model(aws_model_filename)
y_true = []
y_pred = []
for actual_gender, predicted_gender in nb_predict_many(aws_model, testing_filenames[0]):
    y_true.append(actual_gender == "female")
    y_pred.append(predicted_gender == "female")
y_true = np.array(y_true, dtype='int')
y_pred = np.array(y_pred, dtype='int')
print("f1={:.4f}".format(f1_score(y_true, y_pred, pos_label=None)))

The result is much better with the extra data, at 0.81.

Note

If everything went as planned, you may want to remove the bucket from Amazon S3—you will be charged for the storage.

Training on Amazon's EMR infrastructure
Training on Amazon's EMR infrastructure
Training on Amazon's EMR infrastructure
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset