Using dummy estimators to compare results

This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build.

Getting ready

In this recipe, we'll perform the following tasks:

  1. Create some data random data.
  2. Fit the various dummy estimators.

We'll perform these two steps for regression data and classification data.

How to do it...

First, we'll create the random data:

>>> from sklearn.datasets import make_regression, make_classification 
# classification if for later

>>> X, y = make_regression()

>>> from sklearn import dummy

>>> dumdum = dummy.DummyRegressor()

>>> dumdum.fit(X, y)

DummyRegressor(constant=None, strategy='mean')

By default, the estimator will predict by just taking the mean of the values and predicting the mean values:

>>> dumdum.predict(X)[:5]

array([ 2.23297907,  2.23297907,  2.23297907,  2.23297907, 
        2.23297907])

There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value.

Supplying a constant will only be considered if strategy is "constant".

Let's have a look:

>>> predictors = [("mean", None),
                  ("median", None),
                  ("constant", 10)]

>>> for strategy, constant in predictors:
       dumdum = dummy.DummyRegressor(strategy=strategy, 
                constant=constant)
>>> dumdum.fit(X, y)
    
>>> print "strategy: {}".format(strategy), ",".join(map(str, 
          dumdum.predict(X)[:5]))

 strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733
strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0

We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems:

>>> predictors = [("constant", 0),
                  ("stratified", None),
                  ("uniform", None),
                  ("most_frequent", None)]

We'll also need to create some classification data:

>>> X, y = make_classification()


>>> for strategy, constant in predictors:
       dumdum = dummy.DummyClassifier(strategy=strategy, 
                constant=constant)
       dumdum.fit(X, y)
       print "strategy: {}".format(strategy), ",".join(map(str, 
             dumdum.predict(X)[:5]))

strategy: constant 0,0,0,0,0
strategy: stratified 1,0,0,1,0
strategy: uniform 0,0,0,1,1
strategy: most_frequent 1,1,1,1,1

How it works...

It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud.

We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems:

>>> X, y = make_classification(20000, weights=[.95, .05])

>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')

>>> dumdum.fit(X, y)
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

>>> from sklearn.metrics import accuracy_score

>>> print accuracy_score(y, dumdum.predict(X))

0.94575

We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset