Preselecting and processing attributes

We can certainly drop attributes that we think will not help the classifier in distinguishing between good and not-so-good answers. But we have to be cautious here. Although some features do not directly impact the classification, they are still necessary to keep:

  • The PostTypeId attribute, for example, is necessary to distinguish between questions and answers. It will not be picked to serve as a feature, but we will need it to filter the data.
  • CreationDate could be interesting to determine the time span between posting the question and posting the individual answers. In this chapter, however, we will ignore it.
  • Score is important as an indicator of the community's evaluation.
  • ViewCount, in contrast, is most likely of no use for our task. Even if it would help the classifier to distinguish between good and bad, we would not have this information at the time when an answer is being submitted. So we ignore it.
  • The Body attribute obviously contains the most important information. As it is encoded HTML, we will have to decode it to plain text.
  • OwnerUserId is only useful if we take user-dependent features into account, which we won't.
  • The Title attribute is also ignored here, although it could add some more information about the question.
  • CommentCount is also ignored. Similar to ViewCount, it could help the classifier with posts that are out there for a while (more comments = more ambiguous post?). It will, however, not help the classifier at the time an answer is posted.
  • AcceptedAnswerId is similar to Score in that it is an indicator of a post's quality. This is, however, a signal that may get stale over time. Imagine a user posts a question, receives a couple of answers, marks one of them as accepted and forgets about it. Years later, many more users have read the question will have read the answers, some of which didn't exist when the asker accepted the answer. So it might turn out that the highest scored answer is not the accepted one. Since we have the score already, we will ignore the acceptance information.

Suffice to say that in order to speed up processing, we will use the lxml module to parse the XML file and then output two files. In one file, we will store a dictionary that maps a post's Id value to its other data, except Text in the JSON format, so that we can read it easily and keep it in memory in the meta dictionary. For example, the score of a post would reside at meta[post_id]['Score']. We will do the same for the new features that we will create throughout this chapter.

We will then store the actual posts in another tab-separated file, where the first column is Id and the second one is Text, which we can easily read with the following method:

def fetch_posts(fn):
for line in open(fn, "r"):
post_id, text = line.split(" ")
yield int(post_id), text.strip()

We call the two files as follows:

>>> import os
>>> fn_sample = os.path.join('data', "sample.tsv")
>>> fn_sample_meta = os.path.join('data', "sample-meta.json")

For the sake of brevity, please check the Jupyter notebook for the code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset