Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regression model evaluation

We learned about quantifying the error in classification, now we'll discuss quantifying the error for continuous problems. For example, we're trying to predict an age, not a gender.

Getting ready

Like the classification, we'll fake some data, then plot the change. We'll start simple, then build up the complexity. The data will be a simulated linear model:

m = 2
b = 1

y = lambda x: m*x+b

Also, let's get our modules loaded:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from sklearn import metrics

How to do it...

We will be performing the following actions:

Use 'y' to generate 'y_actual'.
Use 'y_actual' plus some err to generate 'y_prediction'.
Plot the differences.
Walk through various metrics and plot some of them.

Let's take care of steps 1 and 2 at the same time and just have a function do the work for us. This will be almost the same thing we just saw, but we'll add the ability to specify an error (or bias if a constant):

>>> def data(x, m=2, b=1, e=None, s=10):
       """  
       Args:
           x: The x value
           m: Slope
           b: Intercept
           e: Error, optional, True will give random error
       """
    
       if e is None:
           e_i = 0
       elif e is True:
           e_i = np.random.normal(0, s, len(xs))
       else:
           e_i = e
    
       return x * m + b + e_i

Now that we have the function, let's define y_hat and y_actual. We'll do it in a convenient way:

>>> from functools import partial

>>> N = 100
>>> xs = np.sort(np.random.rand(N)*100)

>>> y_pred_gen = partial(data, x=xs, e=True)
>>> y_true_gen = partial(data, x=xs)

>>> y_pred = y_pred_gen()
>>> y_true = y_true_gen()


>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.set_title("Plotting the fit vs the underlying process.")
>>> ax.scatter(xs, y_pred, label=r'$hat{y}$')
>>> ax.plot(xs, y_true, label=r'$y$')

>>> ax.legend(loc='best')

The output for this code is as follows:

Just to confirm the output, we'd be working with the classical residuals:

>>> e_hat = y_pred - y_true

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.set_title("Residuals")
>>> ax.hist(e_hat, color='r', alpha=.5, histtype='stepfilled')

The output for the residuals is as follows:

So that looks good now.

How it works...

Now let's move to the metrics.

First, a metric is the mean squared error:

You can use the following code to find the value of the mean squared error:

>>> metrics.mean_squared_error(y_true, y_pred)

93.342352628475368

You'll notice that this code will penalize large errors more than small errors. It's important to remember that all we're doing here is applying what probably was the cost function for the model on the test data.

Another option is the mean absolute deviation. We need to take the absolute value of the difference, if we don't, our value will probably be fairly close to zero, the mean of the distribution:

The final option is R², this is 1 minus the ratio of squared errors for the overall mean and the fit model. As the ratio tends to 0, the R² tends to 1:

>>> metrics.r2_score(y_true, y_pred)

0.9729312117010761

R² is deceptive; it cannot give the clearest sense of the accuracy of the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Regression model evaluation

Create new playlist

Sign In

Sign Up

Regression model evaluation

Getting ready

How to do it...

How it works...

Table of Contents for
Regression model evaluation