Jug is a generic framework, but it's ideally suited for medium-scale data analysis. As you develop your analysis pipeline, it's good to have intermediate results automatically saved. If you have already computed the preprocessing step before and are only changing the features you compute, you do not want to recompute the preprocessing step. If you have already computed the features but want to try combining a few new ones into the mix, you also do not want to recompute all your other features.
Jug is also specifically optimized to work with NumPy arrays. Whenever your tasks return or receive NumPy arrays, you are taking advantage of this optimization. Jug is another piece of this ecosystem where everything works together.
We will now look back at Chapter 12, Computer Vision. In that chapter, we learned how to compute features on images. Remember that the basic pipeline consisted of the following features:
- Loading image files
- Computing features
- Combining these features
- Normalizing the features
- Creating a classifier
We are going to redo this exercise, but this time with the use of jug. The advantage of this version is that it's now possible to add a new feature or classifier without having to recompute all of the pipeline:
- We start with a few imports as follows:
from jug import TaskGenerator import mahotas as mh from glob import glob
- Now, we define the first task generators and feature computation functions:
@TaskGenerator def compute_texture(im): from features import texture imc = mh.imread(im) return texture(mh.colors.rgb2gray(imc)) @TaskGenerator def chist_file(fname): from features import chist im = mh.imread(fname) return chist(im)
The features module we import is the one from Chapter 12, Computer Vision.
- We can use TaskGenerator on any function. This is true even for functions we did not write, such as np.array, np.hstack or the following command:
import numpy as np to_array = TaskGenerator(np.array) hstack = TaskGenerator(np.hstack) haralicks = [] chists = [] labels = [] # Change this variable to point to # the location of the dataset on disk basedir = '../SimpleImageDataset/' # Use glob to get all the images images = glob('{}/*.jpg'.format(basedir)) for fname in sorted(images): haralicks.append(compute_texture(fname)) chists.append(chist_file(fname)) # The class is encoded in the filename as xxxx00.jpg labels.append(fname[:-len('00.jpg')]) haralicks = to_array(haralicks) chists = to_array(chists) labels = to_array(labels)
- One small inconvenience of using jug is that we must always write functions to output the results to files, as shown in the preceding examples. This is a small price to pay for the extra convenience of using jug:
@TaskGenerator def accuracy(features, labels): from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn import model_selection clf = Pipeline([('preproc', StandardScaler()), ('classifier', LogisticRegression())]) cv = model_selection.LeaveOneOut() scores = model_selection.cross_val_score( clf, features, labels, cv=cv) return scores.mean()
- Note that we are only importing sklearn inside this function. This is a small optimization. This way, sklearn is only imported when it's really needed:
scores_base = accuracy(haralicks, labels) scores_chist = accuracy(chists, labels) combined = hstack([chists, haralicks]) scores_combined = accuracy(combined, labels)
- Finally, we write and call a function to print out all results. It expects its argument to be a list of pairs with the name of the algorithm and the results:
@TaskGenerator def print_results(scores): with open('results.image.txt', 'w') as output: for k,v in scores: output.write('Accuracy [{}]: {:.1%}n'.format( k, v.mean())) print_results([ ('base', scores_base), ('chists', scores_chist), ('combined' , scores_combined), ])
- That's it. Now, on the shell, run the following command to run this pipeline using jug:
$ jug execute image-classification.py