Applying logistic regression to our post-classification problem

Admittedly, the example in the previous section was created to show the beauty of logistic regression. How does it perform on the real noisy data?

Comparing it to the best nearest-neighbor classifier (k=40) as a baseline, we see that it won't change the situation a whole lot:

Method

mean(scores)

stddev(scores)

LogReg C=0.001

0.6369

0.0097

LogReg C=0.01

0.6390

0.0109

LogReg C=0.1

0.6382

0.0097

LogReg C=1.00

0.6380

0.0099

LogReg C=10.00

0.6380

0.0097

40NN

0.6425

0.0104

 

We have shown the accuracy for different values of the C regularization parameter. With it, we can control the model complexity, similar to the k parameter for the nearest-neighbor method. Smaller values for C result in more penalization of the model complexity.

A quick look at the bias-variance chart for one of our best candidates, C=0.01, shows that our model has high bias-test and train-error curves, approach closely but stay at unacceptably high values. This indicates that logistic regression with the current feature space is under-fitting and cannot learn a model that captures the data correctly:

So, what now? We switched the model and tuned it as much as we could with our current state of knowledge, but we still have no acceptable classifier. The only thing we gained by switching is that we now have a model that scales with the data, since it doesn't need to store all the instances.

More and more, it seems that either the data is too noisy for this task or that our set of features is still not appropriate enough to discriminate the classes properly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset