Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12

Stretching Python’s Capabilities

In This Chapter

Understanding how Scikit-learn works with classes

Using sparse matrices and the hashing trick

Testing performances and memory consumption

Saving time with multicore algorithms

If you’ve gone through the previous chapters, by this point you’ve dealt with all the basic data loading and manipulation methods offered by Python. Now it’s time to start using some more complex instruments for data wrangling (or munging) and for machine learning. The final step of most data science projects is to build a data tool able to automatically summarize, predict, and recommend directly from your data.

Before taking that final step, you still have to massage your data by enforcing transformations that are even more radical. That’s the data wrangling or data munging part, where sophisticated transformations are followed by visual and statistical explorations, and then again by further transformations. In the following sections, you learn how to handle huge streams of text, explore the basic characteristics of a dataset, optimize the speed of your experiments, compress data and create new synthetic features, generate new groups and classifications, and detect unexpected or exceptional cases that may cause your project to go wrong.

From here onward, you use the Scikit-learn package more and more (which means knowing more about it — the full documentation appears at http://scikit-learn.org/stable/documentation.html). The Scikit-learn package, in fact, offers a single repository containing almost all the tools that you need to be a data scientist and for your data science project to be successful. In this chapter, you discover important characteristics of Scikit-learn, structured in modules, classes, and functions, and some advanced Python time savers for improving performance with big unstructured data and highly time-consuming computational operations.

You don’t have to type the source code for this chapter in by hand. In fact, it’s a lot easier if you use the downloadable source (see the Introduction for download instructions). The source code for this chapter appears in the P4DS4D; 12; Stretching Pythons Capabilities.ipynb source code file.

Playing with Scikit-learn

Sometimes the best way to discover how to use something is to spend time playing with it. The more complex a tool, the more important play becomes. Given the complex math tasks you perform using Scikit-learn, playing becomes especially important. The following sections use the idea of playing with Scikit-learn to help you discover important concepts in using Scikit-learn to perform amazing feats of data science work.

Understanding classes in Scikit-learn

Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package appropriately. Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists. It contains a wide range of well-established learning algorithms, error functions, and testing procedures.

At its core, Scikit-learn features some base classes on which all the algorithms are built. Apart from BaseEstimator, the class from which all other classes inherit, there are four class types covering all the basic machine-learning functionalities:

Classifying
Regressing
Grouping by clusters
Transforming data

Even though each base class has specific methods and attributes, the core functionalities for data processing and machine learning are guaranteed by one or more series of methods and attributes called interfaces. The interfaces provide a uniform Application Programming Interface (API) to enforce similarity of methods and attributes between all the different algorithms present in the package. There are four Scikit-learn object-based interfaces:

estimator: For fitting parameters, learning them from data, according to the algorithm
predictor: For generating predictions from the fitted parameters
transformer: For transforming data, implementing the fitted parameters
model: For reporting goodness of fit or other score measures

The package groups the algorithms built on base classes and one or more object interfaces into modules, each module displaying a specialization in a particular type of machine-learning solution. For example, the linear_model module is for linear modeling, and metrics is for score and loss measure.

In order to find a specific algorithm in Scikit-learn, you must first find the module containing the same kind of algorithm that interests you, and then select it from the list of contents of the module. The algorithm is typically a class itself, whose methods and attributes are already known because they’re common to other algorithms in Scikit-learn.

Getting accustomed to the Scikit-learn class approach may take some time. However, the API is the same for all the tools available in the package, so learning one class necessarily tells you about all the other classes. The best approach is to learn one class completely and then apply what you know to other classes.

Defining applications for data science

Figuring out ways to use data science to obtain constructive results is important. For example, you can apply the estimator interface to a

Classification problem: Guessing that a new observation is from a certain group
Regression problem: Guessing the value of a new observation

It works with the method fit(X,y) where X is the bidimensional array of predictors (the set of observations to learn) and y is the target outcome (another array, unidimensional).

By applying fit, the information in X is related to y, so that, knowing some new information with the same characteristics of X, we can guess correctly y. In the process, some parameters are estimated internally by the fit method. Using fit makes it possible to distinguish between parameters, which are learned, and hyper-parameters, which instead are fixed by you when you instantiate the learner.

Instantiation involves assigning a Scikit-learn class to a Python variable. In addition to hyper-parameters, you can also fix other working parameters, such as requiring normalization or setting a random seed to reproduce the same results for each call, given the same input data.

Here is an example with linear regression, a very basic and common machine-learning algorithm. You must upload some data to use this example, and Scikit-learn provides some useful examples. The Boston dataset, for instance, contains predictor variables that the example code can match against house prices, which helps build a predictor that can figure out the value of a house given some characteristics of it.

from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data,boston.target
print X.shape, y.shape

(506L, 13L) (506L,)

The output specifies that both arrays have the same row number and that the X has 13 features. The shape method performs the analysis on the arrays and reports their dimensions.

The rows of X have to be of the same length as y. You also have to take care that X and y correspond, because learning from data happens when the algorithm matches the rows of X with the corresponding element of y. If you shuffle the two arrays, no learning is possible.

The characteristics of X, expressed as X’s columns, are called variables (a more statistical term) or features (a term more related to machine learning).

Now, after importing the LinearRegression class, we can instantiate a variable called hypothesis and set a parameter indicating the algorithm to standardize (that is to set mean zero and unit standard deviation for all the variables, a statistical operation for having all the variables at a similar level) before estimating the parameters to learn.

from sklearn.linear_model import LinearRegression
hypothesis = LinearRegression(normalize=True)
hypothesis.fit(X,y)
print hypothesis.coef_

[ -1.07170557e-01 4.63952195e-02 2.08602395e-02
   2.68856140e+00 -1.77957587e+01 3.80475246e+00
   7.51061703e-04 -1.47575880e+00 3.05655038e-01
  -1.23293463e-02 -9.53463555e-01 9.39251272e-03
  -5.25466633e-01]

After fitting, hypothesis holds the learned parameters, and you can visualize them using the coef_ method, which is typical of all the linear models (where the model output is a summation of variables weighted by coefficients). You can also call this fitting activity training (as in, “training a machine learning algorithm”).

A hypothesis is a way to describe a learning algorithm trained with data. The hypothesis defines a possible representation of y given X that you test for validity. Therefore, it’s a hypothesis in both scientific and machine-learning language.

Apart from the estimator class, the predictor and the model object classes are also important. The predictor class, which predicts the probability of a certain result, obtains the result of new observations using the predict and predict_proba methods, as in this script:

import numpy as np
new_observation = np.array(
[1,0,1,0,0.5,7,59,6,3,200,20,350,4],
dtype=float)
print hypothesis.predict(new_observation)

25.8972783977

Make sure that new observations have the same feature number and order as in the training X; otherwise, the prediction will be incorrect.

The class model provides information about the quality of the fit using the score method, as shown here:

hypothesis.score(X,y)

0.74060774286494291

In this case, score returns the coefficient of determination R^2 of the prediction. R^2 is a measure ranging from 0 to 1, comparing our predictor to a simple mean. Higher values show that the predictor is working well. Different learning algorithms may use different scoring functions. Please consult the online documentation of each algorithm or ask for help on the Python console:

help(LinearRegression)

The transform class applies transformations derived from the fitting phase to other data arrays. LinearRegression doesn’t have a transform method, but most preprocessing algorithms do. For example, MinMaxScaler, from the Scikit-learn preprocessing module, can transform values in a specific range of minimum and maximum values, learning the transformation formula from an example array.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X)
print scaler.transform(new_observation)

[ 0.01116872 0. 0.01979472 0.
  0.23662551 0.65893849 0.57775489 0.44288845
  0.08695652 0.02480916 0.78723404 0.88173887
  0.06263797]

In this case, the code applies the min and max values learned from X to the new_observation variable, and returns a transformation.

Performing the Hashing Trick

Scikit-learn provides you with most of the data structures and functionality you need to complete your data science project. There are even classes for the trickiest and most advanced problems.

For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the hashing trick. You discover how to work with text by using the bag of words model (as shown in the “Using the Bag of Words Model and Beyond” section of Chapter 7) and weighting them with the TF-IDF. All these powerful transformations can operate properly only if all your text is known and available in the memory of your computer.

A more serious data science challenge is to analyze online-generated text flows, such as from social networks or large online text repositories. This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When working through such problems, knowing the hashing trick can give you quite a few advantages:

Handling large data matrices based on text on the fly
Fixing unexpected values or variables in your textual data
Building scalable algorithms for large collections of documents

Using hash functions

Hash functions can transform any input into an output whose characteristics are predictable. Usually they return a value where the output is bound at a specific interval — whose extremities range from negative to positive numbers or just span through positive numbers. You can imagine them as enforcing a standard on your data — no matter what values you provide, they always return a specific data product.

Their most useful hash function characteristic is that, given a certain input, they always provide the same numeric output value. Consequently, they’re called deterministic functions. For example, input a word like dog and the hashing function always returns the same number.

In a certain sense, hash functions are like a secret code, transforming everything into numbers. Unlike secret codes, however, you can’t convert the hashed code to its original value. In addition, in some rare cases, different words generate the same hashed result (also called a hash collision).

Demonstrating the hashing trick

There are many hash functions, with MD5 (often used to check file integrity, because you can hash entire files) and SHA (used in cryptography) being the most popular. Python possesses a built-in hash function named hash that you can use to compare data objects before storing them in dictionaries. For instance, you can test how Python hashes its name:

hash('Python')
-539294296

The Python session on your computer may return a different value than the one shown on the preceding line. Don’t worry — the built-in hash functions aren’t always consistent across computers. When you need consistent output, rely on the Scikit-learn hash functions instead because the output is consistent across machines.

A Scikit-learn hash function can also return an index in a specific positive range. You can obtain something similar using a built-in hash by employing standard division and its remainder:

abs(hash('Python')) % 1000
296

When you ask for the remainder of the absolute number of the result from the hash function, you get a number that never exceeds the value you used for the division.

To see how this works, pretend that you want to transform a text string from the Internet into a numeric vector (a feature vector) so that you can use it for starting a machine-learning project. A good strategy for managing this data science task is to employ one-hot-encoding, which produces a bag of words. Here are the steps for one-hot-encoding a string (“Python for data science”) into a vector.

Assign a number to each word, for instance, Python=0 for=1 data=2 science=3.
Initialize the vector, counting the number of unique words that you assigned a code in Step 1.
Use the codes assigned in Step 1 as indexes for populating the vector with values, assigning a 1 where there is a coincidence with a word existing in the phrase.

The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements. You have started the machine-learning process, telling the program to expect sequences of four text features, when suddenly a new phrase arrives and you must vectorize the following text as well: “Python for machine learning”. Now you have two new words — “machine learning” — to work with. The following steps help you create the new vectors:

Assign these new codes: machine=4 learning=5.
Enlarge the previous vector to include the new words: [1,1,1,1,0,0].
Compute the vector for the new string: [1,1,0,0,1,1].

One-hot-encoding is quite optimal because it creates efficient and ordered feature vectors. Unfortunately, one-hot-encoding fails and becomes difficult to handle when your project experiences a lot of variability with regard to its inputs. This is a common situation in data science projects working with text or other symbolic features where flow from the Internet or other online environments can suddenly create or add to your initial data. Using hash functions is a smarter way to handle unpredictability in your inputs:

Define a range for the hash function outputs. All your feature vectors will use that range. The example uses a range of values from 0 to 24.
Compute an index for each word in your string using the hash function.
Assign a unit value to vector’s positions according to word indexes.

In Python, you can define a simple hashing trick by creating a function and checking the results using the two test strings:

def hashing_trick(input_string, vector_size=20):
    feature_vector = [0] * vector_size
    for word in input_string.split(' '):
        index = abs(hash(word)) % vector_size
        feature_vector[index] = 1
    return feature_vector

Now you can test both strings.

hashing_trick(input_string='Python for data science',
vector_size=20)
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0]
hashing_trick(input_string='Python for machine learning',
vector_size=20)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
0]

When viewing the feature vectors, you should notice that:

You don’t know where each word is located. When it’s important to be able to reverse the process of assigning words to indexes, you must store the relationship between words and their hashed value separately (for example, you can use a dictionary where the keys are the hashed values and the values are the words).
For small values of the vector_size function parameter (for example, vector_size=10), many words overlap in the same positions in the list representing the feature vector. To keep the overlap to a minimum, you must create hash function boundaries that are greater than the number of elements you plan to index later.

The feature vectors in this example are made mostly of zero entries, representing a waste of memory when compared to the more memory-efficient one-hot-encoding. One of the ways in which you can solve this problem is to rely on sparse matrices, as described in the next section.

Working with deterministic selection

Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix values are zeroes. Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all the cells in the matrix. When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for the coordinates and not finding them. Here’s an example vector:

[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0]

The following code turns it into a sparse matrix.

from scipy.sparse import csc_matrix
print csc_matrix([1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 1, 0, 1, 0])

    (0, 0)     1
    (0, 5)     1
    (0, 16)    1
    (0, 18)    1

Notice that the data representation is in coordinates (expressed in a tuple of row and column index) and the cell value.

The package SciPy offers a large variety of sparse matrix structures — each one storing the data in a different way and each one performing in a different way. (Some are good with slicing; some others are better for computations.) Usually the csc_matrix (a compressed matrix based on rows) is a good choice because most Scikit-learn algorithms accept it as input and it’s optimal for matrix operations.

As a data scientist, you don’t have to worry about programming your own version of the hashing trick unless you would like some special implementation of the idea. Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data matrix using the hashing trick. Here’s an example script that replicates the previous example:

import sklearn.feature_extraction.text as txt
one_hot_enconder = txt.CountVectorizer()
one_hot_enconded = one_hot_enconder.fit_transform(
['Python for data science',
'Python for machine learning'])

<2x6 sparse matrix of type '<type 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>

As soon as new text arrives, CountVectorizer stops working:

one_hot_enconded.transform(['New text has arrived'])
AttributeError: transform not found

Using HashingVectorizer, there is always a place for new words in the data matrix. At worst, a word settles in an already occupied position, causing a word collision.

sklearn_hashing_trick = txt.HashingVectorizer(
     n_features=20, binary=True, norm=None)
text_vector = sklearn_hashing_trick.transform(
     ['Python for data science',
      'Python for machine learning'])
text_vector
<2x20 sparse matrix of type '<type 'numpy.float64'>'
with 8 stored elements in Compressed Sparse Row format>

sklearn_hashing_trick.transform(['New text has arrived'])
<1x20 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>

HashingVectorizer is the perfect function to use when your data can’t fit into memory and its features aren’t fixed. In the other cases, consider using the more intuitive CountVectorizer.

Considering Timing and Performance

As the book introduces more and more complex themes, such as Scikit-learn machine-learning classes and SciPy sparse matrices, you may start to wonder how all this processing might influence application speed. The increased processing requirements affect both running time and available memory.

Managing the best use of machine resources is indeed an art, the art of optimization, and it requires time to master. However, you can start immediately becoming proficient in it by doing some accurate speed measurement and realizing what your problems really are. Profiling the time that operations require, measuring how much memory adding more data takes, or performing a transformation on your data can help you to spot the bottlenecks in your code and start looking for alternative solutions.

As described in Chapter 11, IPython is the perfect environment for experimenting, tweaking, and improving your code. Working on blocks of code, recording the results and outputs, and writing additional notes and comments will help your data science solutions take shape in a controlled and reproducible way.

Benchmarking with timeit

While working through the hashing trick example in the “Performing the Hashing Trick” section, earlier in this chapter, we compare two alternatives for encoding textual information into a data matrix that can address different needs:

CountVectorizer: Optimally encodes text into a data matrix but cannot address subsequent novelties in text.
HashingVectorizer: Provides flexibility in situations when it is likely that the application will receive new data, but is less optimal than techniques based on hashing functions.

Although their advantages are quite clear in terms of how they handle the data, you may wonder what impact using one or the other has on your data processing in terms of speed and memory feasibility.

Concerning speed, IPython offers an easy, out-of-the-box solution, the line magic %timeit and the cell magic %%timeit:

%timeit: Calculates the best performance time for an instruction.
%%timeit: Calculates the best time performance for all the instructions in a cell, apart from the one placed on the same cell line as the cell magic (which could therefore be an initialization instruction).

Both magic commands report the best performance in r trials repeated for n loops. When you add the –r and –n parameters, IPython chooses the number automatically in order to provide a fast answer.

Here is an example of testing whether it is faster to assign a list 10**6 ordinal values by using list comprehension or by appending the values in a for loop:

%timeit l = [k for k in range(10**6)]

10 loops, best of 3: 94.8 ms per loop

The result for the list comprehension can be tested by incrementing both the sample performance and repetitions of the test:

%timeit –n 20 –r 5 l = [k for k in range(10**6)]

20 loops, best of 5: 95.6 ms per loop

Since the for loop requires an entire cell, the example uses the cell magic, %%timeit, call. Notice that the first line that assigns the value of 10**6 to a variable is not considered in the performance.

%%timeit limit = 10**6
l = list()
for k in range(limit):
l.append(k)

10 loops, best of 3: 176 ms per loop

The results show that list comprehension is about 50 percent faster than using a for loop. You can then repeat the test using different text encoding strategies:

import sklearn.feature_extraction.text as txt
sklearn_hashing_trick = txt.HashingVectorizer(
n_features=20, binary=True, norm=None)
enconder = txt.CountVectorizer()
texts = ['Python for data science',
'Python for machine learning']

After performing initial loading of the classes and instantiating them, you can test the two solutions:

%timeit enconded = enconder.fit_transform(texts)

1000 loops, best of 3: 1.27 ms per loop

%timeit hashing = sklearn_hashing_trick.transform(texts)

10000 loops, best of 3: 158 µs per loop

The hashing trick is faster than one hot encoder, and it’s possible to explain the difference by noting that the latter is an optimized algorithm that keeps track of how the words are encoded, something that the hashing trick doesn’t do.

IPython is the best environment to benchmark the speed of your data science solution code. If you’d like to track performance on the command line or in a script running from an IDE, you can import the timeit class and use the timeit function for tracking performance of the command by providing the input parameter as a string.

If your command needs variables, classes, or functions that aren’t available in the base Python (such as the Scikit-learn classes), you can provide them as a second input parameter. You formulate a string in which Python imports all the necessary objects from the main environment, as shown in the following example:

import timeit
cumulative_time = timeit.timeit(
     "hashing = sklearn_hashing_trick.transform(texts)",
     "from __main__ import sklearn_hashing_trick,texts",
     number=10000)
print cumulative_time / 10000.0

Using the preferred installer program (pip)

Python provides a huge number of packages that you can install. Many of these packages come as separate, downloadable modules. Some of them have an executable suitable for a platform such as Windows, which means you can easily install the package. However, many other packages rely on pip, which is a feature that you can access directly from the command line when using later versions of Python, including both 2.7.9 and 3.4.

When working with older versions of Python, you must first install pip by installing a package such as distribute (https://pypi.python.org/pypi/distribute). When working on some Linux or Mac systems, you can also rely on sudo to get the job done by typing sudo apt-get install python3-pip and pressing Enter. You may find that neither of these techniques works for you, so try the instructions found at https://pip.pypa.io/en/latest/installing.html as well.

To use pip, you open a command line or terminal. This book uses IPython as its environment. When you want to install a new feature, you type ipython to start a copy of IPython, –m to load a module, pip to start pip, install to tell pip what action to take, and the name of the package you want to install. For example, to install psutil later in the chapter, you type ipython –m pip install psutil and press Enter.

Working with the memory profiler

As you’ve seen when testing your application code for performance (speed) characteristics, you can obtain analogous information about memory usage. Keeping track of memory consumption could tell you about possible problems in the way data is processed or transmitted to the learning algorithms. The memory_profiler package implements the required functionality. This package is not provided as a default Python or IPython package and it requires installation. Use the following commands to install the package and its dependencies from the command line:

ipython –m pip install psutil
ipython –m pip install memory_profiler

Use the following command for each IPython session you want to monitor:

%load_ext memory_profiler

After performing these tasks, you can easily track how much memory a command consumes:

hashing = sklearn_hashing_trick.transform(texts)
%memit dense_hashing = hashing.toarray()
peak memory: 68.79 MiB, increment: 0.14 MiB

Obtaining a complete overview of memory consumption is possible by saving an IPython cell to disk and then profiling it using the line magic %mprun on an externally imported function. (The line magic works only by operating with external Python scripts.) Profiling produces a detailed report, command by command, as shown in the following example:

%%writefile example_code.py
import sklearn.feature_extraction.text as txt
def comparison_test():
    sklearn_hashing_trick = txt.HashingVectorizer(
        n_features=20, binary=True, norm=None)
    one_hot_enconder = txt.CountVectorizer()
    texts = ['Python for data science',
             'Python for machine learning']
    one_hot_enconded = one_hot_enconder.fit_transform(
        texts)
    hashing = sklearn_hashing_trick.transform(texts)

from example_code import comparison_test
%mprun -f comparison_test comparison_test()

Line  #  Mem usage  Increment  Line  Contents
========================================
      2   68.5 MiB    0.0 MiB  def   comparison_test():
      3   68.5 MiB    0.0 MiB        HashingVectorizer(…)
      4   68.5 MiB    0.0 MiB        CountVectorizer(…)
      5   68.5 MiB    0.0 MiB        texts = […]
      6   68.7 MiB    0.2 MiB        one_hot_enconder.fit_t(…)
      7   68.7 MiB    0.0 MiB        sklearn_hashing_trick.(…)

The resulting report details the memory usage from every line in the function, pointing out the major increments.

Running in Parallel

Most computers today are multicore (two or more processors in a single package), some with multiple physical CPUs. One of the most important limitations of Python is that it uses a single core by default. (It was created in a time when single cores were the norm.)

Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data science relies on repeated tests and experiments on different data matrices. Don’t forget that working with huge data quantities means that most time-consuming transformations repeat observation after observation (for example, identical and not related operations on different parts of a matrix).

Using more CPU cores accelerates a computation by a factor that almost matches the number of cores. For example, having four cores would mean working at best four times faster. You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new running Python instances have to be set up with the right in-memory information and launched; consequently, the improvement will be less than potentially achievable but still significant. Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the number of analyses completed, and for speeding up your operations both when setting up and when using your data products.

Multiprocessing works by replicating the same code and memory content in various new Python instances (the workers), calculating the result for each of them, and returning the pooled results to the main original console. If your original instance already occupies much of the available RAM memory, it won’t be possible to create new instances, and your machine may run out of memory.

Performing multicore parallelism

To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for time-consuming operations, such as replicating models for validating results or for looking for the best hyper-parameters. In particular, Scikit-learn allows multiprocessing when

Cross-validating: Testing the results of a machine-learning hypothesis using different training and testing data
Grid-searching: Systematically changing the hyper-parameters of a machine-learning hypothesis and testing the consequent results
Multilabel prediction: Running an algorithm multiple times against multiple targets when there are many different target outcomes to predict at the same time
Ensemble machine-learning methods: Modeling a large host of classifiers, each one independent from the other, such as when using RandomForest-based modeling

You don’t have to do anything special to take advantage of parallel computations — you can activate parallelism by setting the n_jobs parameter to a number of cores more than 1 or by setting the value to –1, which means you want to use all the available CPU instances.

If you aren’t running your code from the console or from an IPython Notebook, it is extremely important that you separate your code from any package import or global variable assignment in your script by using the if __name__=='__main__': command at the beginning of any code that executes multicore parallelism. The if statement checks whether the program is directly run or is called by an already-running Python console, avoiding any confusion or error by the multiparallel process (such as recursively calling the parallelism).

Demonstrating multiprocessing

It’s a good idea to use IPython when you run a demonstration of how multiprocessing can really save you time during data science projects. Using IPython provides the advantage of using the %timeit magic command for timing execution. You start by loading a multiclass dataset, a complex machine-learning algorithm (the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores from all the procedures. You find details about all these tools later in the book. The most important thing to know is that the procedures become quite large because the SVC produces 10 models, which it repeats 10 times each using cross-validation, for a total of 100 models.

from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data,digits.target
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
%timeit single_core_learning = cross_val_score(SVC(), X,
y, cv=20, n_jobs=1)

Out [1] : 1 loops, best of 3: 17.9 s per loop

After this test, you need to activate the multicore parallelism and time the results using the following commands:

%timeit multi_core_learning = cross_val_score(SVC(), X, y,
cv=20, n_jobs=-1)
Out [2] : 1 loops, best of 3: 11.7 s per loop

The example machine demonstrates a positive advantage using multicore processing, despite using a small dataset where Python spends most of the time starting consoles and running a part of the code in each one. This overhead, a few seconds, is still significant given that the total execution extends for a handful of seconds. Just imagine what would happen if you worked with larger sets of data — your execution time could be easily cut by two or three times.

Although the code works fine with IPython, putting it down in a script and asking Python to run it in a console or using an IDE may cause errors because of the internal operations of a multicore task. The solution, as mentioned before, is to put all the code under an if statement, which checks whether the program started directly and wasn’t called afterward. Here’s an example script:

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
if __name__ == '__main__':
       digits = load_digits()
       X, y = digits.data,digits.target
       multi_core_learning = cross_val_score(SVC(), X, y,
           cv=20, n_jobs=-1)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12: Stretching Python’s Capabilities

Create new playlist

Sign In

Sign Up