Using the Enron dataset

Enron was one of the largest energy companies in the world in the late 1990s, reporting revenue over $100 billion. It has over 20,000 staff and—as of the year 2000—there seemed to be no indications that something was very wrong.

In 2001, the Enron Scandal occurred, where it was discovered that Enron was undertaking systematic, fraudulent accounting practices. This fraud was deliberate, wide-ranging across the company, and for significant amounts of money. After this was publicly discovered, its share price dropped from more than $90 in 2000 to less than $1 in 2001. Enron shortly filed for bankruptcy in a mess that would take more than 5 years to finally be resolved.

As part of the investigation into Enron, the Federal Energy Regulatory Commission in the United States made more than 600,000 e-mails publicly available. Since then, this dataset has been used for everything from social network analysis to fraud analysis. It is also a great dataset for authorship analysis, as we are able to extract e-mails from the sent folder of individual users. This allows us to create a dataset much larger than many previous datasets.

Accessing the Enron dataset

The full set of Enron e-mails is available at https://www.cs.cmu.edu/~./enron/.

Note

The full dataset is 423 MB in a compression format called gzip. If you don't have a Linux-based machine to decompress (unzip) this file, get an alternative program, such as 7-zip (http://www.7-zip.org/).

Download the full corpus and decompress it into your data folder. By default, this will decompress into a folder called enron_mail_20110402.

As we are looking for authorship information, we only want the e-mails we can attribute to a specific author. For that reason, we will look in each user's sent folder—that is, e-mails they have sent.

In the Notebook, setup the data folder for the Enron dataset:

enron_data_folder = os.path.join(os.path.expanduser("~"), "Data", "enron_mail_20110402", "maildir")

Creating a dataset loader

We can now create a function that will choose a couple of authors at random and return each of the e-mails in their sent folder. Specifically, we are looking for the payloads—that is, the content rather than the e-mails themselves. For that, we will need an e-mail parser. The code is as follows:

from email.parser import Parser
p = Parser()

We will be using this later to extract the payloads from the e-mail files that are in the data folder.

We will be choosing authors at random, so we will be using a random state that allows us to replicate the results if we want:

from sklearn.utils import check_random_state

With our data loading function, we are going to have a lot of options. Most of these ensure that our dataset is relatively balanced. Some authors will have thousands of e-mails in their sent mail, while others will have only a few dozen. We limit our search to only authors with at least 10 e-mails using min_docs_author and take a maximum of 100 e-mails from each author using the max_docs_author parameter. We also specify how many authors we want to get—10 by default using the num_authors parameter. The code is as follows:

def get_enron_corpus(num_authors=10, data_folder=data_folder,
                     min_docs_author=10, max_docs_author=100,
                     random_state=None):
    random_state = check_random_state(random_state)

Next, we list all of the folders in the data folder, which are separate e-mail addresses of Enron employees. We when randomly shuffle them, allowing us to choose a new set every time the code is run. Remember that setting the random state will allow us to replicate this result:

    email_addresses = sorted(os.listdir(data_folder))
    random_state.shuffle(email_addresses)

Note

It may seem odd that we sort the e-mail addresses, only to shuffle them around. The os.listdir function doesn't always return the same results, so we sort it first to get some stability. We then shuffle using a random state, which means our shuffling can reproduce a past result if needed.

We then set up our documents and class lists. We also create an author_num, which will tell us which class to use for each new author. We won't use the enumerate trick we used earlier, as it is possible that we won't choose some authors. For example, if an author doesn't have 10 sent e-mails, we will not use it. The code is as follows:

    documents = []
    classes = []
    author_num = 0

We are also going to record which authors we used and which class number we assigned to them. This isn't for the data mining, but will be used in the visualization so we can identify the authors more easily. The dictionary will simply map e-mail usernames to class values. The code is as follows:

    authors = {}

Next, we iterate through each of the e-mail addresses and look for all subfolders with "sent" in the name, indicating a sent mail box. The code is as follows:

    for user in email_addresses:
      users_email_folder = os.path.join(data_folder, user)
      mail_folders = [os.path.join(users_email_folder, subfolder) for subfolder in os.listdir(users_email_folder)
                        if "sent" in subfolder]

We then get each of the e-mails that are in this folder. I've surrounded this call in a try-except block, as some of the authors have subdirectories in their sent mail. We could use some more detailed code to get all of these e-mails, but for now we will just continue and ignore these users. The code is as follows:

        try:
          authored_emails = [open(os.path.join(mail_folder, email_filename), encoding='cp1252').read()
          for mail_folder in mail_folders
          for email_filename in os.listdir(mail_folder)]
        except IsADirectoryError:
            continue

Next we check we have at least 10 e-mails (or whatever min_docs_author is set to):

        if len(authored_emails) < min_docs_author:
            continue

As a next step, if we have too many e-mails from this author, only take the first 100 (from max_docs_author):

        if len(authored_emails) > max_docs_author:
            authored_emails = authored_emails[:max_docs_author]

Next, we parse the e-mail to extract the contents. We aren't interested in the headers—the author has little control over what goes here, so it doesn't make for good data for authorship analysis. We then add those e-mail payloads to our dataset:

        contents = [p.parsestr(email)._payload for email in authored_emails]
        documents.extend(contents)

We then append a class value for this author, for each of the e-mails we added to our dataset:

        classes.extend([author_num] * len(authored_emails))

We then record the class number we used for this author and then increment it:

        authors[user] = author_num
        author_num += 1

We then check if we have enough authors and, if so, we break out of the loop to return the dataset. The code is as follows:

        if author_num >= num_authors or author_num >= len(email_addresses):
            break

We then return the datatset's documents and classes, along with our author mapping. The code is as follows:

    return documents, np.array(classes), authors

Outside this function, we can now get a dataset by making the following function call. We are going to use a random state of 14 here (as always in this module), but you can try other values or set it to none to get a random set each time the function is called:

documents, classes, authors = get_enron_corpus(data_folder=enron_data_folder, random_state=14)

If you have a look at the dataset, there is still a further preprocessing set we need to undertake. Our e-mails are quite messy, but one of the worst bits (from a data analysis perspective) is that these e-mails contain writings from other authors, in the form of attached replies. Take the following e-mail, which is documents[100], for instance:

I am disappointed on the timing but I understand. Thanks. Mark

-----Original Message-----

From: Greenberg, Mark

Sent: Friday, September 28, 2001 4:19 PM

To: Haedicke, Mark E.

Subject: Web Site

Mark -

FYI - I have attached below a screen shot of the proposed new look and feel for the site. We have a couple of tweaks to make, but I believe this is a much cleaner look than what we have now.

This document contains another e-mail attached to the bottom as a reply, a common e-mail pattern. The first part of the e-mail is from Mark Haedicke, while the second is a previous e-mail written to Mark Haedicke by Mark Greenberg. Only the preceding text (the first instance of -----Original Message-----) could be attributed to the author, and this is the only bit we are actually worried about.

Extracting this information generally is not easy. E-mail is a notoriously badly used format. Different e-mail clients add their own headers, define replies in different ways, and just do things however they want. It is really surprising that e-mail works at all in the current environment.

There are some commonly used patterns that we can look for. The quotequail package looks for these and can find the new part of the e-mail, discarding replies and other information.

Tip

You can install quotequail using pip: pip3 install quotequail.

We are going to write a simple function to wrap the quotequail functionality, allowing us to easily call it on all of our documents. First we import quotequail and set up the function definition:

import quotequail
def remove_replies(email_contents):

Next, we use quotequail to unwrap the e-mail, which returns a dictionary containing the different parts of the e-mail. The code is as follows:

    r = quotequail.unwrap(email_contents)

In some cases, r can be none. This happens if the e-mail couldn't be parsed. In this case, we just return the full e-mail contents. This kind of messy solution is often necessary when working with real world datasets. The code is as follows:

    if r is None:
        return email_contents

The actual part of the e-mail we are interested in is called (by quotequail) the text_top. If this exists, we return this as our interesting part of the e-mail. The code is as follows:

    if 'text_top' in r:
        return r['text_top']

If it doesn't exist, quotequail couldn't find it. It is possible it found other text in the e-mail. If that exists, we return only that text. The code is as follows:

    elif 'text' in r:
        return r['text']

Finally, if we couldn't get a result, we just return the e-mail contents, hoping they offer some benefit to our data analysis:

    return email_contents

We can now preprocess all of our documents by running this function on each of them:

documents = [remove_replies(document) for document in documents]

Our preceding e-mail sample is greatly clarified now and contains only the e-mail written by Mark Greenberg:

I am disappointed on the timing but I understand. Thanks. Mark

Putting it all together

We can use the existing parameter space and classifier from our previous experiments—all we need to do is refit it on our new data. By default, training in scikit-learn is done from scratch—subsequent calls to fit() will discard any previous information.

Note

There is a class of algorithms called online learning that update the training with new samples and don't restart their training each time. We will see online learning in action later in this module, including the next chapter, Chapter 10, Clustering News Articles.

As before, we can compute our scores by using cross_val_score and print the results. The code is as follows:

scores = cross_val_score(pipeline, documents, classes, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))

The result is 0.523, which is a reasonable result for such a messy dataset. Adding more data (such as increasing max_docs_author in the dataset loading) can improve these results.

Evaluation

It is generally never a good idea to base an assessment on a single number. In the case of the f-score, it is usually more robust than tricks that give good scores despite not being useful. An example of this is accuracy. As we said in our previous chapter, a spam classifier could predict everything as being spam and get over 80 percent accuracy, although that solution is not useful at all. For that reason, it is usually worth going more in-depth on the results.

To start with, we will look at the confusion matrix, as we did in Chapter 8, Beating CAPTCHAs with Neural Networks. Before we can do that, we need to predict a testing set. The previous code uses cross_val_score, which doesn't actually give us a trained model we can use. So, we will need to refit one. To do that, we need training and testing subsets:

from sklearn.cross_validation import train_test_split
training_documents, testing_documents, y_train, y_test = train_test_split(documents, classes, random_state=14)

Next, we fit the pipeline to our training documents and create our predictions for the testing set:

pipeline.fit(training_documents, y_train)
y_pred = pipeline.predict(testing_documents)

At this point, you might be wondering what the best combination of parameters actually was. We can extract this quite easily from our grid search object (which is the classifier step of our pipeline):

print(pipeline.named_steps['classifier'].best_params_)

The results give you all of the parameters for the classifier. However, most of the parameters are the defaults that we didn't touch. The ones we did search for were C and kernel, which were set to 1 and linear, respectively.

Now we can create a confusion matrix:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_pred, y_test)
cm = cm / cm.astype(np.float).sum(axis=1)

Next we get our authors so that we can label the axis correctly. For this purpose, we use the authors dictionary that our Enron dataset loaded. The code is as follows:

sorted_authors = sorted(authors.keys(), key=lambda x:authors[x])

Finally, we show the confusion matrix using matplotlib. The only changes from the last chapter are highlighted below; just replace the letter labels with the authors from this chapter's experiments:

%matplotlib inline
from matplotlib import pyplot as plt
plt.figure(figsize=(10,10))
plt.imshow(cm, cmap='Blues')
tick_marks = np.arange(len(sorted_authors))
plt.xticks(tick_marks, sorted_authors)
plt.yticks(tick_marks, sorted_authors)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

The results are shown in the following figure:

Evaluation

We can see that authors are predicted correctly in most cases—there is a clear diagonal line with high values. There are some large sources of error though (darker values are larger): e-mails from user baughman-d are typically predicted as being from reitmeyer-j for instance.

Evaluation
Evaluation
Evaluation
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset