Using jug for data analysis

Jug is a generic framework, but it's ideally suited for medium-scale data analysis. As you develop your analysis pipeline, it's good to have intermediate results automatically saved. If you have already computed the preprocessing step before and are only changing the features you compute, you do not want to recompute the preprocessing step. If you have already computed the features but want to try combining a few new ones into the mix, you also do not want to recompute all your other features.

Jug is also specifically optimized to work with NumPy arrays. Whenever your tasks return or receive NumPy arrays, you are taking advantage of this optimization. Jug is another piece of this ecosystem where everything works together.

We will now look back at Chapter 12, Computer Vision. In that chapter, we learned how to compute features on images. Remember that the basic pipeline consisted of the following features:

Loading image files
Computing features
Combining these features
Normalizing the features
Creating a classifier

We are going to redo this exercise, but this time with the use of jug. The advantage of this version is that it's now possible to add a new feature or classifier without having to recompute all of the pipeline:

We start with a few imports as follows:

from jug import TaskGenerator import mahotas as mh from glob import glob

Now, we define the first task generators and feature computation functions:

@TaskGenerator 
def compute_texture(im): 
    from features import texture 
    imc = mh.imread(im) 
    return texture(mh.colors.rgb2gray(imc)) 
 
@TaskGenerator 
def chist_file(fname): 
    from features import chist 
    im = mh.imread(fname) 
    return chist(im)

The features module we import is the one from Chapter 12, Computer Vision.

We write functions that take the filename as input instead of the image array. Using the full images would also work, of course, but this is a small optimization. A filename is a string, which is small if it gets written to the backend. It's also very fast to compute a hash if needed. It also ensures that the images are only loaded by the processes that need them.

We can use TaskGenerator on any function. This is true even for functions we did not write, such as np.array, np.hstack or the following command:

import numpy as np 
to_array = TaskGenerator(np.array) 
hstack = TaskGenerator(np.hstack) 
haralicks = [] 
chists = [] 
labels = [] 
 
# Change this variable to point to 
# the location of the dataset on disk 
basedir = '../SimpleImageDataset/' 
# Use glob to get all the images 
images = glob('{}/*.jpg'.format(basedir)) 
for fname in sorted(images): 
    haralicks.append(compute_texture(fname)) 
    chists.append(chist_file(fname)) 
    # The class is encoded in the filename as xxxx00.jpg 
    labels.append(fname[:-len('00.jpg')]) 
 
haralicks = to_array(haralicks) 
chists = to_array(chists) 
labels = to_array(labels)

One small inconvenience of using jug is that we must always write functions to output the results to files, as shown in the preceding examples. This is a small price to pay for the extra convenience of using jug:

@TaskGenerator 
def accuracy(features, labels): 
    from sklearn.linear_model import LogisticRegression 
    from sklearn.pipeline import Pipeline 
    from sklearn.preprocessing import StandardScaler 
    from sklearn import model_selection 
 
    clf = Pipeline([('preproc', StandardScaler()), 
                ('classifier', LogisticRegression())]) 
    cv = model_selection.LeaveOneOut() 
    scores = model_selection.cross_val_score( 
        clf, features, labels, cv=cv) 
    return scores.mean()

Note that we are only importing sklearn inside this function. This is a small optimization. This way, sklearn is only imported when it's really needed:

scores_base = accuracy(haralicks, labels) 
scores_chist = accuracy(chists, labels) 
combined = hstack([chists, haralicks]) 
scores_combined = accuracy(combined, labels)

Finally, we write and call a function to print out all results. It expects its argument to be a list of pairs with the name of the algorithm and the results:

@TaskGenerator 
def print_results(scores): 
    with open('results.image.txt', 'w') as output: 
        for k,v in scores: 
            output.write('Accuracy [{}]: {:.1%}n'.format( 
                k, v.mean())) 
 
print_results([ 
        ('base', scores_base), 
        ('chists', scores_chist), 
        ('combined' , scores_combined), 
        ])

That's it. Now, on the shell, run the following command to run this pipeline using jug:

$ jug execute image-classification.py

Table of Contents for Using jug for data analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Using jug for data analysis