This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build.
In this recipe, we'll perform the following tasks:
We'll perform these two steps for regression data and classification data.
First, we'll create the random data:
>>> from sklearn.datasets import make_regression, make_classification # classification if for later >>> X, y = make_regression() >>> from sklearn import dummy >>> dumdum = dummy.DummyRegressor() >>> dumdum.fit(X, y) DummyRegressor(constant=None, strategy='mean')
By default, the estimator will predict by just taking the mean of the values and predicting the mean values:
>>> dumdum.predict(X)[:5] array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907])
There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None
from the preceding command). We can also predict the median value.
Supplying a constant will only be considered if strategy is "constant".
Let's have a look:
>>> predictors = [("mean", None), ("median", None), ("constant", 10)] >>> for strategy, constant in predictors: dumdum = dummy.DummyRegressor(strategy=strategy, constant=constant) >>> dumdum.fit(X, y) >>> print "strategy: {}".format(strategy), ",".join(map(str, dumdum.predict(X)[:5])) strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733 strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0
We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems:
>>> predictors = [("constant", 0), ("stratified", None), ("uniform", None), ("most_frequent", None)]
We'll also need to create some classification data:
>>> X, y = make_classification() >>> for strategy, constant in predictors: dumdum = dummy.DummyClassifier(strategy=strategy, constant=constant) dumdum.fit(X, y) print "strategy: {}".format(strategy), ",".join(map(str, dumdum.predict(X)[:5])) strategy: constant 0,0,0,0,0 strategy: stratified 1,0,0,1,0 strategy: uniform 0,0,0,1,1 strategy: most_frequent 1,1,1,1,1
It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud.
We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems:
>>> X, y = make_classification(20000, weights=[.95, .05]) >>> dumdum = dummy.DummyClassifier(strategy='most_frequent') >>> dumdum.fit(X, y) DummyClassifier(constant=None, random_state=None, strategy='most_frequent') >>> from sklearn.metrics import accuracy_score >>> print accuracy_score(y, dumdum.predict(X)) 0.94575
We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time.