In this recipe, we'll use the Gaussian process for regression. In the linear models section, we saw how representing prior information on the coefficients was possible using Bayesian Ridge Regression.
With a Gaussian process, it's about the variance and not the mean. However, with a Gaussian process, we assume the mean is 0, so it's the covariance function we'll need to specify.
The basic setup is similar to how a prior can be put on the coefficients in a typical regression problem. With a GP, a prior can be put on the functional form of the data, and it's the covariance between the data points that is used to model the data, and therefore, must be fit from the data.
So, let's use some regression data and walkthrough how Gaussian processes work in scikit-learn:
>>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> boston_X = boston.data >>> boston_y = boston.target >>> train_set = np.random.choice([True, False], len(boston_y), p=[.75, .25])
Now that we have the data, we'll create a scikit-learn GaussianProcess
object. By default, it uses a constant regression function and squared exponential correlation, which is one of the more common choices:
>>> from sklearn.gaussian_process import GaussianProcess >>> gp = GaussianProcess() >>> gp.fit(boston_X[train_set], boston_y[train_set]) GaussianProcess(beta0=None, corr=<function squared_exponential at 0x110809488>, normalize=True, nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla', random_start=1, random_state=<mtrand.RandomState object at 0x10b9b58b8>, regr=<function constant at 0x1108090c8>, storage_mode='full', theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
That's a formidable object definition. The following are a couple of things to point out:
beta0
: This is the regression weight. This defaults in a way such that MLE is used for estimation.corr
: This is the correlation function. There are several built-in correlation functions. We'll look at more of them in the following How it works... section.regr
: This is the constant regression function.nugget
: This is the regularization parameter. It defaults to a very small number. You can either pass one value to be used for each data point or a single value that needs to be applied uniformly.normalize
: This defaults to True
, and it will center and scale the features. This would be scale is R.Okay, so now that we fit the object, let's look at it's performance against the test object:
>>> test_preds = gp.predict(boston_X[~train_set])
Let's plot the predicted values versus the actual values; then, because we're doing regression, it's probably a good idea to look at plotted residuals and a histogram of the residuals:
>>> from matplotlib import pyplot as plt >>> f, ax = plt.subplots(figsize=(10, 7), nrows=3) >>> f.tight_layout() >>> ax[0].plot(range(len(test_preds)), test_preds, label='Predicted Values'); >>> ax[0].plot(range(len(test_preds)), boston_y[~train_set], label='Actual Values'); >>> ax[0].set_title("Predicted vs Actuals") >>> ax[0].legend(loc='best') >>> ax[1].plot(range(len(test_preds)), test_preds - boston_y[~train_set]); >>> ax[1].set_title("Plotted Residuals") >>> ax[2].hist(test_preds - boston_y[~train_set]); >>> ax[2].set_title("Histogram of Residuals")
The output is as follows:
Now that we've worked through a very quick example, let's look a little more at what some of the parameters do and how we can tune them based on the model we're trying to fit.
First, let's try to understand what's going on with the corr
function. This function describes the relationship between the different pairs of X. The following five different correlation functions are offered by scikit-learn:
absolute_exponential
squared_exponential
generalized_exponential
cubic
linear
For example, the squared exponential has the following form:
Linear, on the other hand, is just the dot product of the two points in question:
Another parameter of interest is theta0
. This represents the starting point in the estimation of the the parameters.
Once we have an estimation of K and the mean, the process is fully specified due to it being a Gaussian process; emphasis is put on Gaussian, a reason it's so popular for general machine learning work.
Let's use a different regr
function, apply a different theta0
, and look at how the predictions differ:
>>> gp = GaussianProcess(regr='linear', theta0=5e-1) >>> gp.fit(boston_X[train_set], boston_y[train_set]); >>> linear_preds = gp.predict(boston_X[~train_set]) >>> f, ax = plt.subplots(figsize=(7, 5))
Let's have a look at the fit:
>>> f.tight_layout() >>> ax.hist(test_preds - boston_y[~train_set], label='Residuals Original', color='b', alpha=.5); >>> ax.hist(linear_preds - boston_y[~train_set], label='Residuals Linear', color='r', alpha=.5); >>> ax.set_title("Residuals") >>> ax.legend(loc='best')
The following is the output:
Clearly, the second model's predictions are slightly better for the most part. If we want to sum this up, we can look at the MSE of the predictions:
>>> np.power(test_preds - boston_y[~train_set], 2).mean() 26.254844099612455 >>> np.power(linear_preds - boston_y[~train_set], 2).mean() 21.938924337056068
We might want to understand the uncertainty in our estimates. When we make the predictions, if we pass the eval_MSE
argument as True
, we'll get MSE and the predicted values. From a mechanics standpoint, a tuple of predictions and MSE is returned:
>>> test_preds, MSE = gp.predict(boston_X[~train_set], eval_MSE=True) >>> MSE[:5] array([ 11.95314572, 8.48397825, 6.0287539 , 29.20844347, 0.36427829])
So, now that we have errors in the estimates (unfortunately), let's plot the first few to get an indication of accuracy:
>>> f, ax = plt.subplots(figsize=(7, 5)) >>> n = 20 >>> rng = range(n) >>> ax.scatter(rng, test_preds[:n]) >>> ax.errorbar(rng, test_preds[:n], yerr=1.96*MSE[:n]) >>> ax.set_title("Predictions with Error Bars") >>> ax.set_xlim((-1, 21));
The following is the output:
As you can see, there's quite a bit of variance in the estimates for a lot of these points. On the other hand, our overall error wasn't too bad.