Parallel grid search

Grid search calculation grows exponentially with each parameter and its possible values we want to tune. We could reduce our response time if we calculate each of the combinations in parallel instead of sequentially, as we have done. In our previous example, we had four different values for gamma and three different values for C, summing up 12 parameter combinations. Additionally, we also needed to train each combination three times (in a three-fold cross-validation), so we summed up 36 trainings and evaluations. We could try to run these 36 tasks in parallel, since the tasks are independent.

Most modern computers have multiple cores that can be used to run tasks in parallel. We also have a very useful tool within IPython, called IPython parallel, that allows us to run independent tasks in parallel, each task in a different core of our machine. Let's do that with our text classifier example.

We will first declare a function that will persist all K folds for the cross-validation in different files. These files will be loaded by a process that will execute the corresponding fold. To do that, we will use the joblib library.

>>> from sklearn.externals import joblib
>>> from sklearn.cross_validation import ShuffleSplit
>>> import os
>>> def persist_cv_splits(X, y, K=3, name='data', 
                          suffix="_cv_%03d.pkl"):
>>>     """Dump K folds to filesystem."""
>>> 
>>>     cv_split_filenames = []
>>> 
>>>     # create KFold cross validation
>>>     cv = KFold(n_samples, K, shuffle=True, random_state=0)
>>> 
>>>     # iterate over the K folds
>>>     for i, (train, test) in enumerate(cv):
>>>         cv_fold = ([X[k] for k in train], y[train], [X[k] for 
                      k in test], y[test])
>>>         cv_split_filename = name + suffix % i
>>>         cv_split_filename = os.path.abspath(cv_split_filename)
>>>         joblib.dump(cv_fold, cv_split_filename)
>>>         cv_split_filenames.append(cv_split_filename)
>>> 
>>>     return cv_split_filenames
>>> cv_filenames = persist_cv_splits(X_train, y_train, name='news')

The following function loads a particular fold and fits the classifier with the specified parameter set, returning the testing score. This function will be called by each of the parallel tasks.

>>> def compute_evaluation(cv_split_filename, clf, params):
>>> 
>>>     # All module imports should be executed in the worker 
        namespace
>>>     from sklearn.externals import joblib
>>> 
>>>     # load the fold training and testing partitions from the    
        filesystem
>>>     X_train, y_train, X_test, y_test = joblib.load(
>>>         cv_split_filename, mmap_mode='c')
>>> 
>>>     clf.set_params(**params)
>>>     clf.fit(X_train, y_train)
>>>     test_score = clf.score(X_test, y_test)
>>>     return test_score

Finally, the following function executes the grid search in parallel tasks. For each parameter combination (returned by the IterGrid iterator), it iterates over K folds and creates a task to compute the evaluation. It returns the parameter combinations alongside the tasks list.

>>> from sklearn.grid_search import IterGrid
>>> 
>>> def parallel_grid_search(lb_view, clf, cv_split_filenames, param_grid):
>>>     all_tasks = []
>>>     all_parameters = list(IterGrid(param_grid))
>>> 
>>>     # iterate over parameter combinations
>>>     for i, params in enumerate(all_parameters):
>>>         task_for_params = []
>>>         # iterate over the K folds
>>>         for j, cv_split_filename in 
                enumerate(cv_split_filenames):    
>>>             t = lb_view.apply(
>>>                 compute_evaluation, cv_split_filename, clf, 
                    params)
>>>             task_for_params.append(t)
>>> 
>>>         all_tasks.append(task_for_params)
>>> 
>>>     return all_parameters, all_tasks

Now we use IPython parallel to get the client and a load balanced view. We must first create a local cluster of N engines (one for each core of your machine) using the Cluster tab in the IPython Notebook. Then we create the client and the view and execute our parallel_grid_search function.

>>> from sklearn.svm import SVC
>>> from IPython.parallel import Client
>>>
>>> client = Client()
>>> lb_view = client.load_balanced_view()
>>>
>>> all_parameters, all_tasks = parallel_grid_search(
    lb_view, clf, cv_filenames, parameters)

IPython parallel will start to run the tasks in parallel. We can use this to monitor the progress of the whole task group.

>>> def print_progress(tasks):
>>>     progress = np.mean([task.ready() for task_group in tasks
                                 for task in task_group])
>>>     print "Tasks completed: {0}%".format(100 * progress)

After all the tasks are completed, use the following function:

>>> print_progress(all_tasks)
Tasks completed: 100.0%

We can define a function that computes the mean score of the completed tasks.

>>> def find_bests(all_parameters, all_tasks, n_top=5):
>>>     """Compute the mean score of the completed tasks"""
>>>     mean_scores = []
>>> 
>>>     for param, task_group in zip(all_parameters, all_tasks):
>>>         scores = [t.get() for t in task_group if t.ready()]
>>>         if len(scores) == 0:
>>>             continue
>>>         mean_scores.append((np.mean(scores), param))
>>> 
>>>     return sorted(mean_scores, reverse=True)[:n_top]
>>> print find_bests(all_parameters, all_tasks)

[(0.81733333333333336, {'svc__gamma': 0.10000000000000001, 'svc__C': 10.0}), (0.78733333333333333, {'svc__gamma': 1.0, 'svc__C': 10.0}), (0.76000000000000012, {'svc__gamma': 1.0, 'svc__C': 1.0}), (0.30099999999999999, {'svc__gamma': 0.01, 'svc__C': 10.0}), (0.19933333333333333, {'svc__gamma': 0.10000000000000001, 'svc__C': 1.0})]

You can observe that we computed the same results as in the previous section, but in half the time (if you used two cores) or in a quarter of the time (if you used four cores).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset