In this recipe, we're going to show how you can keep your model around for a later usage. For example, you might want to actually use a model to predict the outcome and automatically make a decision.
In this recipe, we will perform the following tasks:
To persist models with joblib, the following code can be used:
>>> from sklearn import datasets, tree >>> X, y = datasets.make_classification() >>> dt = tree.DecisionTreeClassifier() >>> dt.fit(X, y) DecisionTreeClassifier(compute_importances=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=None, splitter='best') >>> from sklearn.externals import joblib >>> joblib.dump(dt, "dtree.clf") ['dtree.clf', 'dtree.clf_01.npy', 'dtree.clf_02.npy', 'dtree.clf_03.npy', 'dtree.clf_04.npy']
The preceding code works by saving the state of the object that can be reloaded into a scikit-learn object. It's important to note that the state of model will have varying levels of complexity, given the model type.
For simplicity sake, consider that all we'd need to save is the way to predict the outcome for the given inputs. Well, for regression that would be easy, a little matrix algebra and we're done. However, for models like random forest, where we could have many trees, and those trees could be of various complexity levels, regression is difficult.
We can check the size of decision tree versus random forest:
>>> from sklearn import ensemble >>> rf = ensemble.RandomForestClassifier() >>> rf.fit(X, y) RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0)
I'm going to omit the output, but in total, there we were 52 files outputted on my machine:
>>> joblib.dump(rf, "rf.clf") ['rf.clf', 'rf.clf_01.npy', 'rf.clf_02.npy', 'rf.clf_03.npy', 'rf.clf_04.npy', 'rf.clf_05.npy', 'rf.clf_06.npy',…]