Scikit-learn is a versatile machine learning library in Python. We will use this library extensively in this book. We used scikit-learn version 0.15.2 for all the recipes in this book. In the command line, you can invoke the __version__
attribute to check for the version:
In this recipe, we will demonstrate some of the capabilities of scikit-learn and learn about some of their API organization so that we can follow it seamlessly in our future recipes.
Scikit-learn provides us with an inbuilt dataset. Let's see how to access this dataset and use it:
#Recipe_3a.py from sklearn.datasets import load_iris,load_boston,make_classification make_circles, make_moons # Iris dataset data = load_iris() x = data['data'] y = data['target'] y_labels = data['target_names'] x_labels = data['feature_names'] print print x.shape print y.shape print x_labels print y_labels # Boston dataset data = load_boston() x = data['data'] y = data['target'] x_labels = data['feature_names'] print print x.shape print y.shape print x_labels # make some classification dataset x,y = make_classification(n_samples=50,n_features=5, n_classes=2) print print x.shape print y.shape print x[1,:] print y[1] # Some non linear dataset x,y = make_circles() import numpy as np import matplotlib.pyplot as plt plt.close('all') plt.figure(1) plt.scatter(x[:,0],x[:,1],c=y) x,y = make_moons() import numpy as np import matplotlib.pyplot as plt plt.figure(2) plt.scatter(x[:,0],x[:,1],c=y) plt.show()
Let's proceed with seeing how we can invoke some machine learning functionalities in scikit-learn:
#Recipe_3b.py import numpy as np from sklearn.preprocessing import PolynomialFeatures # Data Preprocessing routines x = np.asmatrix([[1,2],[2,4]]) poly = PolynomialFeatures(degree = 2) poly.fit(x) x_poly = poly.transform(x) print "Original x variable shape",x.shape print x print print "Transformed x variables",x_poly.shape print x_poly #alternatively x_poly = poly.fit_transform(x) from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris data = load_iris() x = data['data'] y = data['target'] estimator = DecisionTreeClassifier() estimator.fit(x,y) predicted_y = estimator.predict(x) predicted_y_prob = estimator.predict_proba(x) predicted_y_lprob = estimator.predict_log_proba(x) from sklearn.pipeline import Pipeline poly = PolynomialFeatures(n=3) tree_estimator = DecisionTreeClassifier() steps = [('poly',poly),('tree',tree_estimator)] estimator = Pipeline(steps=steps) estimator.fit(x,y) predicted_y = estimator.predict(x)
Let's load the scikit learn library and import the module that contains the various functions in order to extract the inbuilt datasets:
from sklearn.datasets import load_iris,load_boston,make_classification
The first dataset that we will look at is the iris dataset. Refer to https://en.wikipedia.org/wiki/Iris_flower_data_set for more information.
Introduced by Sir Donald Fisher, this is a classic dataset for a classification problem:
data = load_iris() x = data['data'] y = data['target'] y_labels = data['target_names'] x_labels = data['feature_names']
The load_iris
function, when invoked, returns a dictionary object. The predictor x
, response variable y
, response variable names, and feature names can be extracted by querying the dictionary object with the appropriate keys.
Let's proceed to print them and see their values:
print print x.shape print y.shape print x_labels print y_labels
As you can see, our predictors have 150 instances and four attributes. Our response
variable has 150 instances and a class label for each of the rows in our predictor set. We will then print out the attribute names, petal and sepal width and length, and finally, the class labels. In most of our future recipes, we will use this dataset extensively.
Let's proceed to inspect another inbuilt dataset called the Boston housing dataset used in a regression problem:
# Boston dataset data = load_boston() x = data['data'] y = data['target'] x_labels = data['feature_names']
The data is loaded pretty much the same as was iris, and the various components of the data, including the predictors and response variables, are queried using the respective keys from the dictionary. Let's print these variables in order to inspect them:
As you can see, our predictor set x has 506 instances and 13 attributes. Our response variable has 506 entries. Finally, we will also print out the names of our attributes.
Scikit-learn also provides us with functions that will help us produce a random classification dataset with some desired properties:
# make some classification dataset x,y = make_classification(n_samples=50,n_features=5, n_classes=2)
The make_classification
function is a function that can be used to generate a classification dataset. In our example, we generated a dataset with 50 instances that are dictated by the n_samples
parameter, five attributes, n_features
parameters, and two classes set by the n_classes
parameter. Let's inspect the output of this function:
print x.shape print y.shape print x[1,:] print y[1]
As you can see, our predictor x has 150 instances with five features. Our response variable has 150 instances, with a class label for each of the prediction instances.
We will print out the second record in our predictor set, x
. You can see that we have a vector of dimension 5
, relating to the five features that we requested. Finally, we will also print the response variable, y
. For the second row of our predictors, the class label is 1
.
Scikit-learn also provides us with the functions that can generate data with nonlinear relationships:
# Some non linear dataset x,y = make_circles() import numpy as npimport matplotlib.pyplot as plt plt.close('all') plt.figure(1) plt.scatter(x[:,0],x[:,1],c=y)
You should be familiar with pyplot
now from the previous recipe. Let's see our plot first to understand the nonlinear relationship:
As you can see, our classification has produced two concentric circles. Our x
is a dataset with two variables. Variable y
is the class label. As shown by the concentric circle, the relationship between our prediction variable is nonlinear.
Another interesting function to produce a nonlinear relationship is make_moons
from scikit-learn:
x,y = make_moons() import numpy as np import matplotlib.pyplot as plt plt.figure(2) plt.scatter(x[:,0],x[:,1],c=y)
Let's look at its plot in order to understand the nonlinear relationship:
The crescent-shaped plot shows that the attributes in our predictor set x are nonlinearly related to each other.
Let's switch gears to understand the API structure of scikit-learn. One of the major advantages of using scikit-learn is its clean API structure. All the data modeling classes deriving from the BaseEstimator
class have to strictly implement the fit
and transform
functions. We will see some examples to learn more about this.
Let's start with the preprocessing module in scikit-learn:
import numpy as np from sklearn.preprocessing import PolynomialFeatures
We will use the PolynomialFeatures
class in order to demonstrate the ease of using scikit-learn's SDK. Refer to the following link for polynomials:
https://en.wikipedia.org/wiki/Polynomial
With a set of predictor variables, we may want to add some more variables to our predictor set in order to see if our model accuracy can be improved. We can use the polynomials of the existing features as a new feature. The PolynomialFeatures
class helps us do this:
# Data Preprocessing routines x = np.asmatrix([[1,2],[2,4]])
We will first create a dataset. In this case, our dataset has two instances and two attributes:
poly = PolynomialFeatures(degree = 2)
We will proceed to instantiate our PolynomialFeatures
class with the required degree of polynomials. In this case, it will be a second degree:
poly.fit(x) x_poly = poly.transform(x)
Then, there are two functions, fit and transform. The fit
function is used to do the necessary calculations for the transformation. In this case, fit is redundant, but we will see some more examples of how fit is used later in this recipe.
The transform
function takes the input and, based on the calculations performed by fit, transforms the given input:
#alternatively x_poly = poly.fit_transform(x)
Alternatively, in this case, fit and transform can be called in one shot. Let's look at the value and shape of our original and transformed x variable:
Any class that implements a machine learning method in scikit-learn has to deliver from BaseEstimator. See the following link for BaseEstimator:
http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html
BaseEstimator expects that the implementation class provides both the fit
and transform
methods. This way the API is kept very clean.
Let's see another example. Here, we imported a class called DecisionTreeClassifier
from the module tree. DecisionTreeClassifier
implements the decision tree algorithm:
from sklearn.tree import DecisionTreeClassifier
Let's put this class into action:
from sklearn.datasets import load_iris data = load_iris() x = data['data'] y = data['target'] estimator = DecisionTreeClassifier() estimator.fit(x,y) predicted_y = estimator.predict(x) predicted_y_prob = estimator.predict_proba(x) predicted_y_lprob = estimator.predict_log_proba(x)
Let's use the iris dataset to see how the tree algorithm can be used. We will load the iris dataset in the x
and y
variables. We will then instantiate DecisonTreeClassifier
. We will proceed to build the model by invoking the fit
function and passing our x predictor
and y response
variable. This will build the tree model. Now, we are ready with our model to do some predictions. We will use the predict
function in order to predict the class labels for the given input. As you can see, we leveraged the same fit and predict method as in PolynomialFeatures
. There are two other methods, predict_proba
, which gives the probability of the prediction, and predict_log_proba
, which provides the logarithm of the prediction probability.
Let's now see another interesting utility called pipe lining. Various machine learning methods can be chained together using pipe lining:
from sklearn.pipeline import Pipeline poly = PolynomialFeatures(n=3) tree_estimator = DecisionTreeClassifier()
Let's start by instantiating the data processing routines, PolynomialFeatures
and DecisionTreeClassifier
:
steps = [('poly',poly),('tree',tree_estimator)]
We will define a list of tuples to indicate the order of our chaining. We want to run the polynomial feature generation, followed by our decision tree:
estimator = Pipeline(steps=steps) estimator.fit(x,y) predicted_y = estimator.predict(x)
We can now instantiate our Pipeline object with the list declared using the steps variable. Now, we can proceed to do business as usual by calling the fit
and predict
methods.
We can invoke the named_steps
attribute in order to inspect the models in the various stages of our pipeline:
There are a lot more dataset creation functions available in scikit-learn. Refer to the following link:
http://scikit-learn.org/stable/datasets/
While creating nonlinear datasets using make_circle
and make_moons
, we mentioned that a lot of desired properties can be added to the dataset. The data can be corrupted slightly by inducing incorrect class labels. Refer to the following link for a list of options that are available in order to introduce such nuances in the data:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html