Chapter 12
In This Chapter
Understanding how Scikit-learn works with classes
Using sparse matrices and the hashing trick
Testing performances and memory consumption
Saving time with multicore algorithms
If you’ve gone through the previous chapters, by this point you’ve dealt with all the basic data loading and manipulation methods offered by Python. Now it’s time to start using some more complex instruments for data wrangling (or munging) and for machine learning. The final step of most data science projects is to build a data tool able to automatically summarize, predict, and recommend directly from your data.
Before taking that final step, you still have to massage your data by enforcing transformations that are even more radical. That’s the data wrangling or data munging part, where sophisticated transformations are followed by visual and statistical explorations, and then again by further transformations. In the following sections, you learn how to handle huge streams of text, explore the basic characteristics of a dataset, optimize the speed of your experiments, compress data and create new synthetic features, generate new groups and classifications, and detect unexpected or exceptional cases that may cause your project to go wrong.
From here onward, you use the Scikit-learn package more and more (which means knowing more about it — the full documentation appears at http://scikit-learn.org/stable/documentation.html
). The Scikit-learn package, in fact, offers a single repository containing almost all the tools that you need to be a data scientist and for your data science project to be successful. In this chapter, you discover important characteristics of Scikit-learn, structured in modules, classes, and functions, and some advanced Python time savers for improving performance with big unstructured data and highly time-consuming computational operations.
Sometimes the best way to discover how to use something is to spend time playing with it. The more complex a tool, the more important play becomes. Given the complex math tasks you perform using Scikit-learn, playing becomes especially important. The following sections use the idea of playing with Scikit-learn to help you discover important concepts in using Scikit-learn to perform amazing feats of data science work.
Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package appropriately. Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists. It contains a wide range of well-established learning algorithms, error functions, and testing procedures.
At its core, Scikit-learn features some base classes on which all the algorithms are built. Apart from BaseEstimator
, the class from which all other classes inherit, there are four class types covering all the basic machine-learning functionalities:
Even though each base class has specific methods and attributes, the core functionalities for data processing and machine learning are guaranteed by one or more series of methods and attributes called interfaces. The interfaces provide a uniform Application Programming Interface (API) to enforce similarity of methods and attributes between all the different algorithms present in the package. There are four Scikit-learn object-based interfaces:
estimator
: For fitting parameters, learning them from data, according to the algorithmpredictor
: For generating predictions from the fitted parameterstransformer
: For transforming data, implementing the fitted parametersmodel
: For reporting goodness of fit or other score measuresThe package groups the algorithms built on base classes and one or more object interfaces into modules, each module displaying a specialization in a particular type of machine-learning solution. For example, the linear_model
module is for linear modeling, and metrics
is for score and loss measure.
In order to find a specific algorithm in Scikit-learn, you must first find the module containing the same kind of algorithm that interests you, and then select it from the list of contents of the module. The algorithm is typically a class itself, whose methods and attributes are already known because they’re common to other algorithms in Scikit-learn.
Figuring out ways to use data science to obtain constructive results is important. For example, you can apply the estimator interface to a
It works with the method fit(X,y)
where X
is the bidimensional array of predictors (the set of observations to learn) and y
is the target outcome (another array, unidimensional).
By applying fit
, the information in X
is related to y
, so that, knowing some new information with the same characteristics of X
, we can guess correctly y
. In the process, some parameters are estimated internally by the fit
method. Using fit
makes it possible to distinguish between parameters, which are learned, and hyper-parameters, which instead are fixed by you when you instantiate the learner.
Instantiation involves assigning a Scikit-learn class to a Python variable. In addition to hyper-parameters, you can also fix other working parameters, such as requiring normalization or setting a random seed to reproduce the same results for each call, given the same input data.
Here is an example with linear regression, a very basic and common machine-learning algorithm. You must upload some data to use this example, and Scikit-learn provides some useful examples. The Boston dataset, for instance, contains predictor variables that the example code can match against house prices, which helps build a predictor that can figure out the value of a house given some characteristics of it.
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data,boston.target
print X.shape, y.shape
(506L, 13L) (506L,)
The output specifies that both arrays have the same row number and that the X
has 13 features. The shape
method performs the analysis on the arrays and reports their dimensions.
Now, after importing the LinearRegression
class, we can instantiate a variable called hypothesis
and set a parameter indicating the algorithm to standardize (that is to set mean zero and unit standard deviation for all the variables, a statistical operation for having all the variables at a similar level) before estimating the parameters to learn.
from sklearn.linear_model import LinearRegression
hypothesis = LinearRegression(normalize=True)
hypothesis.fit(X,y)
print hypothesis.coef_
[ -1.07170557e-01 4.63952195e-02 2.08602395e-02
2.68856140e+00 -1.77957587e+01 3.80475246e+00
7.51061703e-04 -1.47575880e+00 3.05655038e-01
-1.23293463e-02 -9.53463555e-01 9.39251272e-03
-5.25466633e-01]
After fitting, hypothesis
holds the learned parameters, and you can visualize them using the coef_
method, which is typical of all the linear models (where the model output is a summation of variables weighted by coefficients). You can also call this fitting activity training (as in, “training a machine learning algorithm”).
Apart from the estimator class, the predictor and the model object classes are also important. The predictor class, which predicts the probability of a certain result, obtains the result of new observations using the predict
and predict_proba
methods, as in this script:
import numpy as np
new_observation = np.array(
[1,0,1,0,0.5,7,59,6,3,200,20,350,4],
dtype=float)
print hypothesis.predict(new_observation)
25.8972783977
The class model provides information about the quality of the fit using the score
method, as shown here:
hypothesis.score(X,y)
0.74060774286494291
In this case, score
returns the coefficient of determination R^2 of the prediction. R^2 is a measure ranging from 0 to 1, comparing our predictor to a simple mean. Higher values show that the predictor is working well. Different learning algorithms may use different scoring functions. Please consult the online documentation of each algorithm or ask for help on the Python console:
help(LinearRegression)
The transform class applies transformations derived from the fitting phase to other data arrays. LinearRegression
doesn’t have a transform method, but most preprocessing algorithms do. For example, MinMaxScaler
, from the Scikit-learn preprocessing
module, can transform values in a specific range of minimum and maximum values, learning the transformation formula from an example array.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X)
print scaler.transform(new_observation)
[ 0.01116872 0. 0.01979472 0.
0.23662551 0.65893849 0.57775489 0.44288845
0.08695652 0.02480916 0.78723404 0.88173887
0.06263797]
In this case, the code applies the min
and max
values learned from X
to the new_observation
variable, and returns a transformation.
Scikit-learn provides you with most of the data structures and functionality you need to complete your data science project. There are even classes for the trickiest and most advanced problems.
For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the hashing trick. You discover how to work with text by using the bag of words model (as shown in the “Using the Bag of Words Model and Beyond” section of Chapter 7) and weighting them with the TF-IDF. All these powerful transformations can operate properly only if all your text is known and available in the memory of your computer.
A more serious data science challenge is to analyze online-generated text flows, such as from social networks or large online text repositories. This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When working through such problems, knowing the hashing trick can give you quite a few advantages:
Hash functions can transform any input into an output whose characteristics are predictable. Usually they return a value where the output is bound at a specific interval — whose extremities range from negative to positive numbers or just span through positive numbers. You can imagine them as enforcing a standard on your data — no matter what values you provide, they always return a specific data product.
Their most useful hash function characteristic is that, given a certain input, they always provide the same numeric output value. Consequently, they’re called deterministic functions. For example, input a word like dog and the hashing function always returns the same number.
In a certain sense, hash functions are like a secret code, transforming everything into numbers. Unlike secret codes, however, you can’t convert the hashed code to its original value. In addition, in some rare cases, different words generate the same hashed result (also called a hash collision).
There are many hash functions, with MD5 (often used to check file integrity, because you can hash entire files) and SHA (used in cryptography) being the most popular. Python possesses a built-in hash function named hash
that you can use to compare data objects before storing them in dictionaries. For instance, you can test how Python hashes its name:
hash('Python')
-539294296
A Scikit-learn hash function can also return an index in a specific positive range. You can obtain something similar using a built-in hash by employing standard division and its remainder:
abs(hash('Python')) % 1000
296
When you ask for the remainder of the absolute number of the result from the hash function, you get a number that never exceeds the value you used for the division.
To see how this works, pretend that you want to transform a text string from the Internet into a numeric vector (a feature vector) so that you can use it for starting a machine-learning project. A good strategy for managing this data science task is to employ one-hot-encoding, which produces a bag of words. Here are the steps for one-hot-encoding a string (“Python for data science”) into a vector.
The resulting feature vector is expressed as the sequence [1,1,1,1]
and made of exactly four elements. You have started the machine-learning process, telling the program to expect sequences of four text features, when suddenly a new phrase arrives and you must vectorize the following text as well: “Python for machine learning”. Now you have two new words — “machine learning” — to work with. The following steps help you create the new vectors:
[1,1,1,1,0,0]
.[1,1,0,0,1,1]
.One-hot-encoding is quite optimal because it creates efficient and ordered feature vectors. Unfortunately, one-hot-encoding fails and becomes difficult to handle when your project experiences a lot of variability with regard to its inputs. This is a common situation in data science projects working with text or other symbolic features where flow from the Internet or other online environments can suddenly create or add to your initial data. Using hash functions is a smarter way to handle unpredictability in your inputs:
In Python, you can define a simple hashing trick by creating a function and checking the results using the two test strings:
def hashing_trick(input_string, vector_size=20):
feature_vector = [0] * vector_size
for word in input_string.split(' '):
index = abs(hash(word)) % vector_size
feature_vector[index] = 1
return feature_vector
Now you can test both strings.
hashing_trick(input_string='Python for data science',
vector_size=20)
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0]
hashing_trick(input_string='Python for machine learning',
vector_size=20)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
0]
When viewing the feature vectors, you should notice that:
vector_size
function parameter (for example, vector_size=10
), many words overlap in the same positions in the list representing the feature vector. To keep the overlap to a minimum, you must create hash function boundaries that are greater than the number of elements you plan to index later.The feature vectors in this example are made mostly of zero entries, representing a waste of memory when compared to the more memory-efficient one-hot-encoding. One of the ways in which you can solve this problem is to rely on sparse matrices, as described in the next section.
Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix values are zeroes. Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all the cells in the matrix. When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for the coordinates and not finding them. Here’s an example vector:
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0]
The following code turns it into a sparse matrix.
from scipy.sparse import csc_matrix
print csc_matrix([1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0])
(0, 0) 1
(0, 5) 1
(0, 16) 1
(0, 18) 1
Notice that the data representation is in coordinates (expressed in a tuple of row and column index) and the cell value.
The package SciPy offers a large variety of sparse matrix structures — each one storing the data in a different way and each one performing in a different way. (Some are good with slicing; some others are better for computations.) Usually the csc_matrix
(a compressed matrix based on rows) is a good choice because most Scikit-learn algorithms accept it as input and it’s optimal for matrix operations.
As a data scientist, you don’t have to worry about programming your own version of the hashing trick unless you would like some special implementation of the idea. Scikit-learn offers HashingVectorizer
, a class that rapidly transforms any collection of text into a sparse data matrix using the hashing trick. Here’s an example script that replicates the previous example:
import sklearn.feature_extraction.text as txt
one_hot_enconder = txt.CountVectorizer()
one_hot_enconded = one_hot_enconder.fit_transform(
['Python for data science',
'Python for machine learning'])
<2x6 sparse matrix of type '<type 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>
As soon as new text arrives, CountVectorizer
stops working:
one_hot_enconded.transform(['New text has arrived'])
AttributeError: transform not found
Using HashingVectorizer
, there is always a place for new words in the data matrix. At worst, a word settles in an already occupied position, causing a word collision.
sklearn_hashing_trick = txt.HashingVectorizer(
n_features=20, binary=True, norm=None)
text_vector = sklearn_hashing_trick.transform(
['Python for data science',
'Python for machine learning'])
text_vector
<2x20 sparse matrix of type '<type 'numpy.float64'>'
with 8 stored elements in Compressed Sparse Row format>
sklearn_hashing_trick.transform(['New text has arrived'])
<1x20 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
As the book introduces more and more complex themes, such as Scikit-learn machine-learning classes and SciPy sparse matrices, you may start to wonder how all this processing might influence application speed. The increased processing requirements affect both running time and available memory.
Managing the best use of machine resources is indeed an art, the art of optimization, and it requires time to master. However, you can start immediately becoming proficient in it by doing some accurate speed measurement and realizing what your problems really are. Profiling the time that operations require, measuring how much memory adding more data takes, or performing a transformation on your data can help you to spot the bottlenecks in your code and start looking for alternative solutions.
As described in Chapter 11, IPython is the perfect environment for experimenting, tweaking, and improving your code. Working on blocks of code, recording the results and outputs, and writing additional notes and comments will help your data science solutions take shape in a controlled and reproducible way.
While working through the hashing trick example in the “Performing the Hashing Trick” section, earlier in this chapter, we compare two alternatives for encoding textual information into a data matrix that can address different needs:
CountVectorizer
: Optimally encodes text into a data matrix but cannot address subsequent novelties in text.HashingVectorizer
: Provides flexibility in situations when it is likely that the application will receive new data, but is less optimal than techniques based on hashing functions.Although their advantages are quite clear in terms of how they handle the data, you may wonder what impact using one or the other has on your data processing in terms of speed and memory feasibility.
Concerning speed, IPython offers an easy, out-of-the-box solution, the line magic %timeit
and the cell magic %%timeit
:
%timeit
: Calculates the best performance time for an instruction.%%timeit
: Calculates the best time performance for all the instructions in a cell, apart from the one placed on the same cell line as the cell magic (which could therefore be an initialization instruction).Both magic commands report the best performance in r
trials repeated for n
loops. When you add the –r and –n parameters, IPython chooses the number automatically in order to provide a fast answer.
Here is an example of testing whether it is faster to assign a list 10**6 ordinal values by using list comprehension or by appending the values in a for
loop:
%timeit l = [k for k in range(10**6)]
10 loops, best of 3: 94.8 ms per loop
The result for the list comprehension can be tested by incrementing both the sample performance and repetitions of the test:
%timeit –n 20 –r 5 l = [k for k in range(10**6)]
20 loops, best of 5: 95.6 ms per loop
Since the for
loop requires an entire cell, the example uses the cell magic, %%timeit
, call. Notice that the first line that assigns the value of 10**6 to a variable is not considered in the performance.
%%timeit limit = 10**6
l = list()
for k in range(limit):
l.append(k)
10 loops, best of 3: 176 ms per loop
The results show that list comprehension is about 50 percent faster than using a for
loop. You can then repeat the test using different text encoding strategies:
import sklearn.feature_extraction.text as txt
sklearn_hashing_trick = txt.HashingVectorizer(
n_features=20, binary=True, norm=None)
enconder = txt.CountVectorizer()
texts = ['Python for data science',
'Python for machine learning']
After performing initial loading of the classes and instantiating them, you can test the two solutions:
%timeit enconded = enconder.fit_transform(texts)
1000 loops, best of 3: 1.27 ms per loop
%timeit hashing = sklearn_hashing_trick.transform(texts)
10000 loops, best of 3: 158 µs per loop
The hashing trick is faster than one hot encoder, and it’s possible to explain the difference by noting that the latter is an optimized algorithm that keeps track of how the words are encoded, something that the hashing trick doesn’t do.
IPython is the best environment to benchmark the speed of your data science solution code. If you’d like to track performance on the command line or in a script running from an IDE, you can import the timeit
class and use the timeit
function for tracking performance of the command by providing the input parameter as a string.
If your command needs variables, classes, or functions that aren’t available in the base Python (such as the Scikit-learn classes), you can provide them as a second input parameter. You formulate a string in which Python imports all the necessary objects from the main environment, as shown in the following example:
import timeit
cumulative_time = timeit.timeit(
"hashing = sklearn_hashing_trick.transform(texts)",
"from __main__ import sklearn_hashing_trick,texts",
number=10000)
print cumulative_time / 10000.0
As you’ve seen when testing your application code for performance (speed) characteristics, you can obtain analogous information about memory usage. Keeping track of memory consumption could tell you about possible problems in the way data is processed or transmitted to the learning algorithms. The memory_profiler
package implements the required functionality. This package is not provided as a default Python or IPython package and it requires installation. Use the following commands to install the package and its dependencies from the command line:
ipython –m pip install psutil
ipython –m pip install memory_profiler
Use the following command for each IPython session you want to monitor:
%load_ext memory_profiler
After performing these tasks, you can easily track how much memory a command consumes:
hashing = sklearn_hashing_trick.transform(texts)
%memit dense_hashing = hashing.toarray()
peak memory: 68.79 MiB, increment: 0.14 MiB
Obtaining a complete overview of memory consumption is possible by saving an IPython cell to disk and then profiling it using the line magic %mprun
on an externally imported function. (The line magic works only by operating with external Python scripts.) Profiling produces a detailed report, command by command, as shown in the following example:
%%writefile example_code.py
import sklearn.feature_extraction.text as txt
def comparison_test():
sklearn_hashing_trick = txt.HashingVectorizer(
n_features=20, binary=True, norm=None)
one_hot_enconder = txt.CountVectorizer()
texts = ['Python for data science',
'Python for machine learning']
one_hot_enconded = one_hot_enconder.fit_transform(
texts)
hashing = sklearn_hashing_trick.transform(texts)
from example_code import comparison_test
%mprun -f comparison_test comparison_test()
Line # Mem usage Increment Line Contents
========================================
2 68.5 MiB 0.0 MiB def comparison_test():
3 68.5 MiB 0.0 MiB HashingVectorizer(…)
4 68.5 MiB 0.0 MiB CountVectorizer(…)
5 68.5 MiB 0.0 MiB texts = […]
6 68.7 MiB 0.2 MiB one_hot_enconder.fit_t(…)
7 68.7 MiB 0.0 MiB sklearn_hashing_trick.(…)
The resulting report details the memory usage from every line in the function, pointing out the major increments.
Most computers today are multicore (two or more processors in a single package), some with multiple physical CPUs. One of the most important limitations of Python is that it uses a single core by default. (It was created in a time when single cores were the norm.)
Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data science relies on repeated tests and experiments on different data matrices. Don’t forget that working with huge data quantities means that most time-consuming transformations repeat observation after observation (for example, identical and not related operations on different parts of a matrix).
Using more CPU cores accelerates a computation by a factor that almost matches the number of cores. For example, having four cores would mean working at best four times faster. You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new running Python instances have to be set up with the right in-memory information and launched; consequently, the improvement will be less than potentially achievable but still significant. Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the number of analyses completed, and for speeding up your operations both when setting up and when using your data products.
To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for time-consuming operations, such as replicating models for validating results or for looking for the best hyper-parameters. In particular, Scikit-learn allows multiprocessing when
RandomForest
-based modelingYou don’t have to do anything special to take advantage of parallel computations — you can activate parallelism by setting the n_jobs
parameter to a number of cores more than 1 or by setting the value to –1, which means you want to use all the available CPU instances.
It’s a good idea to use IPython when you run a demonstration of how multiprocessing can really save you time during data science projects. Using IPython provides the advantage of using the %timeit
magic command for timing execution. You start by loading a multiclass dataset, a complex machine-learning algorithm (the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores from all the procedures. You find details about all these tools later in the book. The most important thing to know is that the procedures become quite large because the SVC produces 10 models, which it repeats 10 times each using cross-validation, for a total of 100 models.
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data,digits.target
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
%timeit single_core_learning = cross_val_score(SVC(), X,
y, cv=20, n_jobs=1)
Out [1] : 1 loops, best of 3: 17.9 s per loop
After this test, you need to activate the multicore parallelism and time the results using the following commands:
%timeit multi_core_learning = cross_val_score(SVC(), X, y,
cv=20, n_jobs=-1)
Out [2] : 1 loops, best of 3: 11.7 s per loop
The example machine demonstrates a positive advantage using multicore processing, despite using a small dataset where Python spends most of the time starting consoles and running a part of the code in each one. This overhead, a few seconds, is still significant given that the total execution extends for a handful of seconds. Just imagine what would happen if you worked with larger sets of data — your execution time could be easily cut by two or three times.
Although the code works fine with IPython, putting it down in a script and asking Python to run it in a console or using an IDE may cause errors because of the internal operations of a multicore task. The solution, as mentioned before, is to put all the code under an if
statement, which checks whether the program started directly and wasn’t called afterward. Here’s an example script:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
if __name__ == '__main__':
digits = load_digits()
X, y = digits.data,digits.target
multi_core_learning = cross_val_score(SVC(), X, y,
cv=20, n_jobs=-1)