Regression model evaluation

We learned about quantifying the error in classification, now we'll discuss quantifying the error for continuous problems. For example, we're trying to predict an age, not a gender.

Getting ready

Like the classification, we'll fake some data, then plot the change. We'll start simple, then build up the complexity. The data will be a simulated linear model:

m = 2
b = 1

y = lambda x: m*x+b

Also, let's get our modules loaded:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from sklearn import metrics

How to do it...

We will be performing the following actions:

  1. Use 'y' to generate 'y_actual'.
  2. Use 'y_actual' plus some err to generate 'y_prediction'.
  3. Plot the differences.
  4. Walk through various metrics and plot some of them.

Let's take care of steps 1 and 2 at the same time and just have a function do the work for us. This will be almost the same thing we just saw, but we'll add the ability to specify an error (or bias if a constant):

>>> def data(x, m=2, b=1, e=None, s=10):
       """  
       Args:
           x: The x value
           m: Slope
           b: Intercept
           e: Error, optional, True will give random error
       """
    
       if e is None:
           e_i = 0
       elif e is True:
           e_i = np.random.normal(0, s, len(xs))
       else:
           e_i = e
    
       return x * m + b + e_i

Now that we have the function, let's define y_hat and y_actual. We'll do it in a convenient way:

>>> from functools import partial

>>> N = 100
>>> xs = np.sort(np.random.rand(N)*100)

>>> y_pred_gen = partial(data, x=xs, e=True)
>>> y_true_gen = partial(data, x=xs)

>>> y_pred = y_pred_gen()
>>> y_true = y_true_gen()


>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.set_title("Plotting the fit vs the underlying process.")
>>> ax.scatter(xs, y_pred, label=r'$hat{y}$')
>>> ax.plot(xs, y_true, label=r'$y$')

>>> ax.legend(loc='best')

The output for this code is as follows:

How to do it...

Just to confirm the output, we'd be working with the classical residuals:

>>> e_hat = y_pred - y_true

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.set_title("Residuals")
>>> ax.hist(e_hat, color='r', alpha=.5, histtype='stepfilled')

The output for the residuals is as follows:

How to do it...

So that looks good now.

How it works...

Now let's move to the metrics.

First, a metric is the mean squared error:

How it works...

You can use the following code to find the value of the mean squared error:

>>> metrics.mean_squared_error(y_true, y_pred)

93.342352628475368

You'll notice that this code will penalize large errors more than small errors. It's important to remember that all we're doing here is applying what probably was the cost function for the model on the test data.

Another option is the mean absolute deviation. We need to take the absolute value of the difference, if we don't, our value will probably be fairly close to zero, the mean of the distribution:

How it works...

The final option is R2, this is 1 minus the ratio of squared errors for the overall mean and the fit model. As the ratio tends to 0, the R2 tends to 1:

>>> metrics.r2_score(y_true, y_pred)

0.9729312117010761

R2 is deceptive; it cannot give the clearest sense of the accuracy of the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset