Preprocessing and cleaning the data

It is more convenient for SciPy to separate the dimensions into two vectors, each the size of 743 data points. The first vector, x, will contain the hours, and the other, y, will contain the web hits in that particular hour. This splitting is done using the special index notation of SciPy, by which we can choose the columns individually:

x = data[:,0]
y = data[:,1]

There are many more ways in which data can be selected from a SciPy array. Check out https://docs.scipy.org/doc/numpy/user/quickstart.html for more details on indexing, slicing, and iterating.

One caveat is that we still have some values in y that contain invalid values, such as nan. The question is what we can do with them. Let's check how many hours contain invalid data by running the following code:

>>> np.sum(np.isnan(y)) 
8

As you can see, we are missing only 8 out of 743 entries, so we can afford to remove them. Remember that we can index a SciPy array with another array. The Sp.isnan(y) phrase returns an array of Booleans indicating whether an entry is a number or not. Using ~, we logically negate that array so that we choose only those elements from x and y where y contains valid numbers:

>>> x = x[~np.isnan(y)]
>>> y = y[~np.isnan(y)]

To get the first impression of our data, let's plot the data in a scatter plot using matplotlib. Matplotlib contains the pyplot package, which tries to mimic MATLAB's interface, which is a very convenient and easy-to-use interface, as you can see in the following code:

import matplotlib.pyplot as plt

def plot_web_traffic(x, y, models=None): 
    '''
    Plot the web traffic (y) over time (x).
    If models is given, it is expected to be a list of fitted models, 
    which will be plotted as well (used later in this chapter).
    '''
    plt.figure(figsize=(12,6)) # width and height of the plot in inches 
    plt.scatter(x, y, s=10)
    plt.title("Web traffic over the last month")

    plt.xlabel("Time") 
    plt.ylabel("Hits/hour") 
    plt.xticks([w*7*24 for w in range(5)],
               ['week %i' %(w+1) for w in range(5)])
    if models:
        colors = ['g', 'k', 'b', 'm', 'r'] 
        linestyles = ['-', '-.', '--', ':', '-']

        mx = sp.linspace(0, x[-1], 1000)
        for model, style, color in zip(models, linestyles, colors): 
            plt.plot(mx, model(mx), linestyle=style, linewidth=2, c=color)

        plt.legend(["d=%i" % m.order for m in models], loc="upper left") 
   plt.autoscale(tight=True)
   plt.grid()

The main command here is plt.scatter(x, y, s=10), which plots the web traffic in y over the individual days in x. With s=10 we are setting the line width. Then we are dressing up the chart a bit (title, labels, grid, and so on) and finally we provide the possibility to add additional models to it.

You can find more tutorials on plotting at http://matplotlib.org/users/pyplot_tutorial.html.

You can run this function with the following:

>>> plot_web_traffic(x, y)

We will see what happens if you are in a Jupyter notebook session and have run the following command:

>>> %matplotlib inline

In one of the cells of the notebook, Jupyter will automatically show the generated graphs inline, using the following code:

>>> plot_web_traffic(x, y)

If you are in a normal command shell, you will have to save the graph to disk and then display it later with an image viewer:

>>> plt.savefig("web_traffic.png"))

In the resulting chart, we can see that while in the first weeks the traffic stayed more or less the same, the last week shows a steep increase:

Table of Contents for Preprocessing and cleaning the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preprocessing and cleaning the data