Engineering the features

As mentioned earlier, we will use the Text and Score features to train our classifier. The problem with Text is that the classifier does not work well with strings. We will have to convert it into one or more numbers. So, what statistics could be useful to extract from a post? Let's start with the number of HTML links, assuming that good posts have a higher chance of having links in them.

We can do this with regular expressions. The following captures all HTML link tags that start with http:// (ignoring the other protocols for now):

import re
link_match = re.compile('<a href="http://.*?".*?>(.*?)</a>',
re.MULTILINE | re.DOTALL)

However, we do not want to count links that are part of a code block. If, for example, a post explains the usage of the requests Python module, it would most likely also contain URLs. This means that we have to iterate over all code blocks, count the links in there, and subtract them later on from the total link count. This can be done by another regular expression that matches the <pre> tag, which is used on the StackExchange sites to mark up code:

code_match = re.compile('<pre>(.*?)</pre>',
re.MULTILINE | re.DOTALL)

def extract_features_from_body(s):
link_count_in_code = 0
# count links in code to later subtract them
for match_str in code_match.findall(s):
link_count_in_code += len(link_match.findall(match_str))

return len(link_match.findall(s)) – link_count_in_code
For production systems, we would not want to parse HTML content with regular expressions. Instead, we should rely on excellent libraries such as BeautifulSoup, which does a marvelous job of robustly handling all the weird things that typically occur in everyday HTML.

With this in place, we can generate one feature per answer and store it in meta. But before we train the classifier, let's have a look at what we will train it with. We can get a first impression with the frequency distribution of our new feature. This can be done by plotting the percentage of how often each value occurs in the data:

import matplotlib.pyplot as plt

X = np.asarray([[meta[aid]['LinkCount']] for aid in all_answers])
plt.figure(figsize=(5,4), dpi=300)
plt.title('LinkCount')
plt.xlabel('Value')
plt.ylabel('Occurrence')

n, bins, patches = plt.hist(X, normed=1,
bins=range(max(X.ravel())-min(X.ravel())),
alpha=0.75)

plt.grid(True)

Refer to the following graph:

With the majority of posts having no link at all, we know now that this feature will not make a good classifier. Let's try it out anyway to get a first estimation of where we are.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset