Machine learning with scikit-learn

Scikit-learn is a versatile machine learning library in Python. We will use this library extensively in this book. We used scikit-learn version 0.15.2 for all the recipes in this book. In the command line, you can invoke the __version__ attribute to check for the version:

Machine learning with scikit-learn

Getting ready

In this recipe, we will demonstrate some of the capabilities of scikit-learn and learn about some of their API organization so that we can follow it seamlessly in our future recipes.

How to do it…

Scikit-learn provides us with an inbuilt dataset. Let's see how to access this dataset and use it:

#Recipe_3a.py
from sklearn.datasets import load_iris,load_boston,make_classification                         make_circles, make_moons


# Iris dataset
data = load_iris()
x = data['data']
y = data['target']
y_labels = data['target_names']
x_labels = data['feature_names']

print
print x.shape
print y.shape
print x_labels
print y_labels

# Boston dataset
data = load_boston()
x = data['data']
y = data['target']
x_labels = data['feature_names']

print
print x.shape
print y.shape
print x_labels


# make some classification dataset
x,y = make_classification(n_samples=50,n_features=5, n_classes=2)

print
print x.shape
print y.shape

print x[1,:]
print y[1]

# Some non linear dataset
x,y = make_circles()
import numpy as np
import matplotlib.pyplot as plt
plt.close('all')
plt.figure(1)
plt.scatter(x[:,0],x[:,1],c=y)

x,y = make_moons()
import numpy as np
import matplotlib.pyplot as plt
plt.figure(2)
plt.scatter(x[:,0],x[:,1],c=y)

plt.show()

Let's proceed with seeing how we can invoke some machine learning functionalities in scikit-learn:

#Recipe_3b.py
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
# Data Preprocessing routines
x = np.asmatrix([[1,2],[2,4]])
poly = PolynomialFeatures(degree = 2)
poly.fit(x)
x_poly = poly.transform(x)

print "Original x variable shape",x.shape
print x
print
print "Transformed x variables",x_poly.shape
print x_poly


#alternatively 
x_poly = poly.fit_transform(x)


from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

data = load_iris()
x = data['data']
y = data['target']

estimator = DecisionTreeClassifier()
estimator.fit(x,y)
predicted_y = estimator.predict(x)
predicted_y_prob = estimator.predict_proba(x)
predicted_y_lprob = estimator.predict_log_proba(x)


from sklearn.pipeline import Pipeline

poly = PolynomialFeatures(n=3)
tree_estimator = DecisionTreeClassifier()

steps = [('poly',poly),('tree',tree_estimator)]
estimator = Pipeline(steps=steps)
estimator.fit(x,y)
predicted_y = estimator.predict(x)

How it works…

Let's load the scikit learn library and import the module that contains the various functions in order to extract the inbuilt datasets:

from sklearn.datasets import load_iris,load_boston,make_classification

The first dataset that we will look at is the iris dataset. Refer to https://en.wikipedia.org/wiki/Iris_flower_data_set for more information.

Introduced by Sir Donald Fisher, this is a classic dataset for a classification problem:

data = load_iris()
x = data['data']
y = data['target']
y_labels = data['target_names']
x_labels = data['feature_names']

The load_iris function, when invoked, returns a dictionary object. The predictor x, response variable y, response variable names, and feature names can be extracted by querying the dictionary object with the appropriate keys.

Let's proceed to print them and see their values:

print
print x.shape
print y.shape
print x_labels
print y_labels
How it works…

As you can see, our predictors have 150 instances and four attributes. Our response variable has 150 instances and a class label for each of the rows in our predictor set. We will then print out the attribute names, petal and sepal width and length, and finally, the class labels. In most of our future recipes, we will use this dataset extensively.

Let's proceed to inspect another inbuilt dataset called the Boston housing dataset used in a regression problem:

# Boston dataset
data = load_boston()
x = data['data']
y = data['target']
x_labels = data['feature_names']

The data is loaded pretty much the same as was iris, and the various components of the data, including the predictors and response variables, are queried using the respective keys from the dictionary. Let's print these variables in order to inspect them:

How it works…

As you can see, our predictor set x has 506 instances and 13 attributes. Our response variable has 506 entries. Finally, we will also print out the names of our attributes.

Scikit-learn also provides us with functions that will help us produce a random classification dataset with some desired properties:

# make some classification dataset
x,y = make_classification(n_samples=50,n_features=5, n_classes=2)

The make_classification function is a function that can be used to generate a classification dataset. In our example, we generated a dataset with 50 instances that are dictated by the n_samples parameter, five attributes, n_features parameters, and two classes set by the n_classes parameter. Let's inspect the output of this function:

print x.shape
print y.shape

print x[1,:]
print y[1]
How it works…

As you can see, our predictor x has 150 instances with five features. Our response variable has 150 instances, with a class label for each of the prediction instances.

We will print out the second record in our predictor set, x. You can see that we have a vector of dimension 5, relating to the five features that we requested. Finally, we will also print the response variable, y. For the second row of our predictors, the class label is 1.

Scikit-learn also provides us with the functions that can generate data with nonlinear relationships:

# Some non linear dataset
x,y = make_circles()
import numpy as npimport matplotlib.pyplot as plt
plt.close('all')
plt.figure(1)
plt.scatter(x[:,0],x[:,1],c=y)

You should be familiar with pyplot now from the previous recipe. Let's see our plot first to understand the nonlinear relationship:

How it works…

As you can see, our classification has produced two concentric circles. Our x is a dataset with two variables. Variable y is the class label. As shown by the concentric circle, the relationship between our prediction variable is nonlinear.

Another interesting function to produce a nonlinear relationship is make_moons from scikit-learn:

x,y = make_moons()
import numpy as np
import matplotlib.pyplot as plt
plt.figure(2)
plt.scatter(x[:,0],x[:,1],c=y)

Let's look at its plot in order to understand the nonlinear relationship:

How it works…

The crescent-shaped plot shows that the attributes in our predictor set x are nonlinearly related to each other.

Let's switch gears to understand the API structure of scikit-learn. One of the major advantages of using scikit-learn is its clean API structure. All the data modeling classes deriving from the BaseEstimator class have to strictly implement the fit and transform functions. We will see some examples to learn more about this.

Let's start with the preprocessing module in scikit-learn:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures

We will use the PolynomialFeatures class in order to demonstrate the ease of using scikit-learn's SDK. Refer to the following link for polynomials:

https://en.wikipedia.org/wiki/Polynomial

With a set of predictor variables, we may want to add some more variables to our predictor set in order to see if our model accuracy can be improved. We can use the polynomials of the existing features as a new feature. The PolynomialFeatures class helps us do this:

# Data Preprocessing routines
x = np.asmatrix([[1,2],[2,4]])

We will first create a dataset. In this case, our dataset has two instances and two attributes:

poly = PolynomialFeatures(degree = 2)

We will proceed to instantiate our PolynomialFeatures class with the required degree of polynomials. In this case, it will be a second degree:

poly.fit(x)
x_poly = poly.transform(x)

Then, there are two functions, fit and transform. The fit function is used to do the necessary calculations for the transformation. In this case, fit is redundant, but we will see some more examples of how fit is used later in this recipe.

The transform function takes the input and, based on the calculations performed by fit, transforms the given input:

#alternatively 
x_poly = poly.fit_transform(x)

Alternatively, in this case, fit and transform can be called in one shot. Let's look at the value and shape of our original and transformed x variable:

How it works…

Any class that implements a machine learning method in scikit-learn has to deliver from BaseEstimator. See the following link for BaseEstimator:

http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html

BaseEstimator expects that the implementation class provides both the fit and transform methods. This way the API is kept very clean.

Let's see another example. Here, we imported a class called DecisionTreeClassifier from the module tree. DecisionTreeClassifier implements the decision tree algorithm:

from sklearn.tree import DecisionTreeClassifier

Let's put this class into action:

from sklearn.datasets import load_iris

data = load_iris()
x = data['data']
y = data['target']

estimator = DecisionTreeClassifier()
estimator.fit(x,y)
predicted_y = estimator.predict(x)
predicted_y_prob = estimator.predict_proba(x)
predicted_y_lprob = estimator.predict_log_proba(x)

Let's use the iris dataset to see how the tree algorithm can be used. We will load the iris dataset in the x and y variables. We will then instantiate DecisonTreeClassifier. We will proceed to build the model by invoking the fit function and passing our x predictor and y response variable. This will build the tree model. Now, we are ready with our model to do some predictions. We will use the predict function in order to predict the class labels for the given input. As you can see, we leveraged the same fit and predict method as in PolynomialFeatures. There are two other methods, predict_proba, which gives the probability of the prediction, and predict_log_proba, which provides the logarithm of the prediction probability.

Let's now see another interesting utility called pipe lining. Various machine learning methods can be chained together using pipe lining:

from sklearn.pipeline import Pipeline

poly = PolynomialFeatures(n=3)
tree_estimator = DecisionTreeClassifier()

Let's start by instantiating the data processing routines, PolynomialFeatures and DecisionTreeClassifier:

steps = [('poly',poly),('tree',tree_estimator)]

We will define a list of tuples to indicate the order of our chaining. We want to run the polynomial feature generation, followed by our decision tree:

estimator = Pipeline(steps=steps)
estimator.fit(x,y)
predicted_y = estimator.predict(x)

We can now instantiate our Pipeline object with the list declared using the steps variable. Now, we can proceed to do business as usual by calling the fit and predict methods.

We can invoke the named_steps attribute in order to inspect the models in the various stages of our pipeline:

How it works…

There's more…

There are a lot more dataset creation functions available in scikit-learn. Refer to the following link:

http://scikit-learn.org/stable/datasets/

While creating nonlinear datasets using make_circle and make_moons, we mentioned that a lot of desired properties can be added to the dataset. The data can be corrupted slightly by inducing incorrect class labels. Refer to the following link for a list of options that are available in order to introduce such nuances in the data:

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html

See also

  • Plotting recipe in Chapter 2, Working with Python Environments
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset