Defining what a good answer is

Before we can train a classifier to distinguish between good and bad answers, we have to create the training data. So far, we only have a bunch of data. We still need to define labels.

Of course, we could simply take the best and worst-scoring answer per question as positive and negative examples. However, what do we do with questions that have only good answers, say, one with two and the other with four points? Should we really take the answer with two points as a negative example just because it happened to be the one with the lower score? Or let's say that we have only two negative answers, one with a score of -2 and the other with -4. Clearly, we cannot take the answer with -2 as a positive example.

We will therefore look for answers that have at least an answer with a score higher than 0 and at least one with a negative score and throw away those that don't fit this criterion. If we take all the remaining data, we would have to wait quite some time at every step, so we filter down further to 10,000 questions. From those, we will pick the highest-scoring answer as the positive example and the lowest-scoring answer as a negative one, which results into 20,000 answers for our training set.

As mentioned earlier, throughout this chapter (and in the Jupyter notebook), we will maintain a meta dictionary, which maps the answer IDs to the features, of which score is one (we will design more features along the way). Therefore, we can create our labels as follows:

>>> all_answers = [a for a,v in meta.items() if v['ParentId']!=-1]
>>> Y = np.asarray([meta[aid]['Score'] > 0 for aid in all_answers])
>>> print(np.unique(Y, return_counts=True))
(array([False, True], dtype=bool), array([10000, 10000], dtype=int64))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset