Designing more features

In addition to using the number of hyperlinks as a proxy for a post's quality, the number of code lines is possibly another good one, too. At least, it is a good indicator that the post's author is interested in answering the question. We can find the code embedded in the <pre>...</pre> tag. And once we have it extracted, we should count the number of normal words in the post:

# we will use regular expression to remove HTML tags
tag_match = re.compile('<[^>]*>', re.MULTILINE | re.DOTALL)
whitespace_match = re.compile(r's+', re.MULTILINE | re.DOTALL)

def extract_features_from_body(s):
num_code_lines = 0
link_count_in_code = 0

# remove source code and count how many lines the post has
code_free_s = s
for match_str in code_match.findall(s):
num_code_lines += match_str.count(' ')
code_free_s = code_match.sub(' ', code_free_s)

# Sometimes source code contains links, which we don't want to
# count
link_count_in_code += len(link_match.findall(match_str))

links = link_match.findall(s)
link_count = len(links) - link_count_in_code

html_free_s = tag_match.sub(' ', code_free_s)

text = html_free_s
for link in links:
if link.lower().startswith('http://'):
text = text.replace(link, ' ')

text = whitespace_match.sub(' ', text)
num_text_tokens = text.count(' ')

return num_text_tokens, num_code_lines, link_count

Looking at this, we notice that at least the number of words in a post shows higher variability:

Since we have multiple features, we standardize their values:

scores = []
for train, test in cv.split(X, Y):
clf = make_pipeline(StandardScaler(), KNeighborsClassifier())
clf.fit(X[train], Y[train])
scores.append(clf.score(X[test], Y[test]))

print("Mean(scores)=%.5f Stddev(scores)=%.5f"%(np.mean(scores), np.std(scores)))

Training on the bigger feature space improves accuracy quite a bit:

Mean(scores)=0.60070 Stddev(scores)=0.00759

But still, this would mean that we would classify roughly 4 out of 10 wrong. At least we are going in the right direction. More features lead to higher accuracy, which leads us to add more features. Therefore, let's extend the feature space by even more features:

  • AvgSentLen: This measures the average number of words in a sentence. Maybe there is a pattern that particularly good posts don't overload the reader's brain with overly long sentences
  • AvgWordLen: Similar to AvgSentLen, this feature measures the average number of characters in the words of a post
  • NumAllCaps: This measures the number of words that are written in uppercase, which is considered poor style
  • NumExclams: This measures the number of exclamation marks

We will use NLTK to conveniently determine sentence and word boundaries, calculate the features, and immediately attach them to the meta dictionary that already holds the other features:

import nltk
def add_sentence_features(m):
for pid, text in fetch_posts(fn_sample):
if not text:
for feat in ['AvgSentLen', 'AvgWordLen',
'NumAllCaps', 'NumExclams']:
m[pid][feat] = 0
else:
sent_lens = [len(nltk.word_tokenize(sent)) for sent in
nltk.sent_tokenize(text)]
m[pid]['AvgSentLen'] = np.mean(sent_lens)
text_tokens = nltk.word_tokenize(text)
m[pid]['AvgWordLen'] = np.mean([len(w) for w in text_tokens])
m[pid]['NumAllCaps'] = np.sum([word.isupper()
for word in text_tokens])
m[pid]['NumExclams'] = text.count('!')
add_sentence_features(meta)

The following charts show the value distributions for average sentence and word lengths, as well as the number of uppercase words and exclamation marks:

With these four additional features, we now have seven features representing the individual posts. Let's see how we progress:

Mean(scores)=0.60225 Stddev(scores)=0.00729  

Now, that's interesting. We added four more features and didn't get anything in return. How can that be?

To understand this, we have to remind ourselves how kNN works. Our 5NN classifier determines the class of a new post by calculating the seven aforementioned features—LinkCount, NumTextTokens, NumCodeLines, AvgSentLen, AvgWordLen, NumAllCaps, and NumExclams—and then finds the five nearest other posts. The new post's class is then the majority of the classes of those nearest posts. The nearest posts are determined by calculating the Euclidean distance (as we did not specify it, the classifier was initialized with the default p=2, which is the parameter in the Minkowski distance). That means that all seven features are treated similarly. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset