Persisting models with joblib

In this recipe, we're going to show how you can keep your model around for a later usage. For example, you might want to actually use a model to predict the outcome and automatically make a decision.

Getting ready

In this recipe, we will perform the following tasks:

  1. Fit the model that we will persist.
  2. Import joblib and save the model.

How to do it...

To persist models with joblib, the following code can be used:

>>> from sklearn import datasets, tree

>>> X, y = datasets.make_classification()
>>> dt = tree.DecisionTreeClassifier()
>>> dt.fit(X, y)

DecisionTreeClassifier(compute_importances=None, criterion='gini', 
                       max_depth=None, max_features=None, 
                       max_leaf_nodes=None, min_density=None, 
                       min_samples_leaf=1, min_samples_split=2, 
                       random_state=None, splitter='best')

>>> from sklearn.externals import joblib


>>> joblib.dump(dt, "dtree.clf")

['dtree.clf',
 'dtree.clf_01.npy',
 'dtree.clf_02.npy',
 'dtree.clf_03.npy',
 'dtree.clf_04.npy']

How it works...

The preceding code works by saving the state of the object that can be reloaded into a scikit-learn object. It's important to note that the state of model will have varying levels of complexity, given the model type.

For simplicity sake, consider that all we'd need to save is the way to predict the outcome for the given inputs. Well, for regression that would be easy, a little matrix algebra and we're done. However, for models like random forest, where we could have many trees, and those trees could be of various complexity levels, regression is difficult.

There's more...

We can check the size of decision tree versus random forest:

>>> from sklearn import ensemble

>>> rf = ensemble.RandomForestClassifier()
>>> rf.fit(X, y)

RandomForestClassifier(bootstrap=True, compute_importances=None, 
                       criterion='gini', max_depth=None, 
                       max_features='auto', max_leaf_nodes=None, 
                       min_density=None, min_samples_leaf=1, 
                       min_samples_split=2, n_estimators=10, 
                       n_jobs=1, oob_score=False, random_state=None, 
                       verbose=0)

I'm going to omit the output, but in total, there we were 52 files outputted on my machine:

>>> joblib.dump(rf, "rf.clf")
['rf.clf',
 'rf.clf_01.npy',
 'rf.clf_02.npy',
 'rf.clf_03.npy',
 'rf.clf_04.npy',
 'rf.clf_05.npy',
 'rf.clf_06.npy',…]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset