Chapter 8: Support Vector Regression

Support vector regression (SVR) can be an excellent option when the assumptions of linear regression models do not hold, such as when the relationship between our features and our target is too complicated to be described by a linear combination of weights. Even better, SVR allows us to model that complexity without having to expand the feature space.

Support vector machines identify the hyperplane that maximizes the margin between two classes. The support vectors are the data points closest to the margin that support it, if you will. This turns out to be as useful for regression modeling as it is for classification. SVR finds the hyperplane containing the greatest number of data points. We will discuss how that works in the first section of this chapter.

Rather than minimizing the sum of the squared residuals, as ordinary least squares regression does, SVR minimizes the coefficients within an acceptable error range. Like ridge and lasso regression, this can reduce model variance and the risk of overfitting. SVR works best when we are working with a small- to medium-sized dataset.

The algorithm is also quite flexible, allowing us to specify the acceptable error range, use kernels to model nonlinear relationships, and adjust hyperparameters to get the best bias-variance tradeoff possible. We will demonstrate that in this chapter.

In this chapter, we will cover the following topics:

  • Key concepts of SVR
  • SVR with a linear model
  • Using kernels for nonlinear SVR

Technical requirements

In this chapter, we will be working with the scikit-learn and matplotlib libraries. You can use pip to install these packages.

Key concepts of SVR

We will start this section by discussing how support vector machines are used for classification. We will not go into much detail here, leaving a detailed discussion of support vector classification to Chapter 13, Support Vector Machine Classification. But starting with support vector machines for classification will lead nicely to an explanation of SVR.

As I discussed at the beginning of this chapter, support vector machines find the hyperplane that maximizes the margin between classes. When there are only two features present, that hyperplane is just a line. Consider the following example plot:

Figure 8.1 – Support vector machine classification based on two features

Figure 8.1 – Support vector machine classification based on two features

The two classes in this diagram, represented by red circles and blue squares, are linearly separable using the two features, x1 and x2. The bold line is the decision boundary. It is the line that is furthest away from border data points for each class, or the maximum margin. These points are known as the support vectors.

Since the data in the preceding plot is linearly separable, we can use what is known as hard margin classification without problems; that is, we can be strict about all the observations for each class being on the correct side of the decision boundary. But what if our data points look like what’s shown in the following plot?

Figure 8.2 – Support vector machine classification with soft margins

Figure 8.2 – Support vector machine classification with soft margins

These data points are not linearly separable. In this case, we can choose soft margin classification and ignore the outlier red circles.

We will discuss support vector classification in much greater detail in Chapter 13, Support Vector Machine Classification, but this illustrates some of the key support vector machine concepts. These concepts can be applied well to models involving a continuous target. This is called support vector regression or SVR.

When building an SVR model, we decide on the acceptable amount of prediction error, ɛ. Errors within ɛ of our prediction, , in a one-feature model are not penalized. This is sometimes referred to as the epsilon-insensitive tube. SVR minimizes the coefficients consistent with all data points falling within that range. This is illustrated in the following plot:

Figure 8.3 – SVR with an acceptable error range

Figure 8.3 – SVR with an acceptable error range

Stated more precisely, SVR minimizes the square of the coefficients, subject to the constraint that the error, ε, does not exceed a given amount.

It minimizes with the constraint that , where is a vector of weights (or coefficients), is the actual target value minus the predicted value, and is the acceptable amount of error.

Of course, it is not reasonable to expect all the data points to fall within the desired range. But we can still seek to minimize that deviation. Let’s denote the distance of the wayward points from the margin as ξ. This gives us a new objective function.

We minimize with the constraint that , where C is a hyperparameter indicating how tolerant the model should be of errors outside the margin. A value of 0 for C means that it is not at all tolerant of those large errors. This is equivalent to the original objective function:

Figure 8.4 – SVR with data points outside the acceptable range

Figure 8.4 – SVR with data points outside the acceptable range

Here, we can see several advantages of SVR. It is sometimes more important that our errors will not exceed a certain amount, than picking a model with the lowest absolute error. It may matter more if we are often off by a little but rarely by a lot than if we are often spot on but occasionally way off. Since this approach also minimizes our weights, it has the same advantages as regularization, and we reduce the likelihood of overfitting.

Nonlinear SVR and the kernel trick

We have not yet fully addressed the issue of linear separability with SVR. For simplicity, we will return to a classification problem involving two features. Let’s look at a plot of two features against a categorical target. The target has two possible values, represented by the dots and squares. x1 and x2 are numeric and have negative values:

Figure 8.5 – Class labels not linearly separable with two features

Figure 8.5 – Class labels not linearly separable with two features

What can we do in a case like this to identify a margin between the classes? It is often the case that a margin can be identified at a higher dimension. In this example, we can use a polynomial transformation, as illustrated in the following plot:

Figure 8.6 – Using polynomial transformation to establish the margin

Figure 8.6 – Using polynomial transformation to establish the margin

There is now a third dimension, which is the sum of the squares of x1 and x2. The dots are all higher than the squares. This is similar to how we used polynomial transformation in the previous chapter to specify a nonlinear regression model.

One drawback of this approach is that we can quickly end up with too many features for our model to perform well. This is where the kernel trick comes in very handy. SVR can use a kernel function to expand the feature space implicitly without actually creating more features. This is done by creating a vector of values that can be used to fit a nonlinear margin.

While this allows us to fit a polynomial transformation such as the hypothetical one illustrated in the preceding plot, the most frequently used kernel function with SVR is the radial basis function (RBF). RBF is popular because it is faster than the other common kernel functions and because its gamma parameter makes it very flexible. We will explore how to use it in the last section of this chapter.

But for now, let’s start with a relatively straightforward linear model to see SVR in action.

SVR with a linear model

We often have enough domain knowledge to take an approach that is more nuanced than simply minimizing prediction errors in our training data. Using this knowledge may allow us to accept more bias in our model, when small amounts of bias do not matter much substantively, to reduce variance. With SVR, we can adjust hyperparameters such as epsilon (the acceptable error range) and C (which adjusts the tolerance for errors outside of that range) to improve our model’s performance.

If a linear model can perform well on your data, linear SVR might be a good choice. We can build a linear SVR model with scikit-learn’s LinearSVR class. Let’s try creating a linear SVR model with the gasoline tax data that we used in the previous chapter:

  1. We need many of the same libraries that we used in the previous chapter to create the training and testing DataFrames and to preprocess the data. We also need to import the LinearSVR and uniform modules from scikit-learn and scipy, respectively:

    import pandas as pd

    import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.svm import LinearSVR

    from scipy.stats import uniform

    from sklearn.impute import SimpleImputer

    from sklearn.pipeline import make_pipeline

    from sklearn.compose import ColumnTransformer

    from sklearn.compose import TransformedTargetRegressor

    from sklearn.impute import KNNImputer

    from sklearn.model_selection import cross_validate,

      KFold, GridSearchCV, RandomizedSearchCV

    import sklearn.metrics as skmet

    import matplotlib.pyplot as plt

  2. We also need to import the OutlierTrans class, which we first discussed in Chapter 7, Linear Regression Models, to handle outliers:

    import os

    import sys

    sys.path.append(os.getcwd() + "/helperfunctions")

    from preprocfunc import OutlierTrans

  3. Next, we load the gasoline tax data and create training and testing DataFrames. We create lists for numerical and binary features, as well as a separate list for motorization_rate. As we saw when we looked at the data in the previous chapter, we need to do a little more preprocessing with motorization_rate.

This dataset contains gasoline tax data for each country in 2014, as well as fuel income dependence and measures of the strength of the democratic institutions: polity, democracy_polity, and autocracy_polity. democracy_polity is a binarized polity variable, taking on a value of 1 for countries with high polity scores. autocracy_polity has a value of 1 for countries with low polity scores. The polity feature is a measure of how democratic a country is:

fftaxrate14 = pd.read_csv("data/fossilfueltaxrate14.csv")

fftaxrate14.set_index('countrycode', inplace=True)

num_cols = ['fuel_income_dependence',

  'national_income_per_cap', 'VAT_Rate',  

  'gov_debt_per_gdp', 'polity','goveffect',

  'democracy_index']

dummy_cols = 'democracy_polity','autocracy_polity',

  'democracy', 'nat_oil_comp','nat_oil_comp_state']

spec_cols = ['motorization_rate']

target = fftaxrate14[['gas_tax_imp']]

features = fftaxrate14[num_cols + dummy_cols + spec_cols]

X_train, X_test, y_train, y_test =  

  train_test_split(features,

    target, test_size=0.2, random_state=0)

  1. Let’s look at summary statistics for the training data. We will need to standardize the data since there are dramatically different ranges and SVR performs much better on standardized data. Also, notice that motorization_rate has a lot of missing values. We may want to do better than simple imputation with that feature. We have decent non-missing counts for the dummy columns:

    X_train.shape

    (123, 13)

    X_train[num_cols + spec_cols].

      agg(['count','min','median','max']).T

                          count min    median   max

    fuel_income_dependence  121 0.00   0.10     34.23

    national_income_per_cap 121 260.00 6,110.00 104,540.00

    VAT_Rate                121 0.00   16.00    27.00

    gov_debt_per_gdp        112 1.56   38.45    194.76

    polity                  121 -10.00 6.00     10.00

    goveffect               123 -2.04  -0.10    2.18

    democracy_index         121 0.03   0.54     0.93

    motorization_rate       100 0.00   0.20     0.81

    X_train[dummy_cols].apply(pd.value_counts, normalize=True).T

                                        0.00         1.00

    democracy_polity                    0.42         0.58

    autocracy_polity                    0.88         0.12

    democracy                           0.41         0.59

    nat_oil_comp                        0.54         0.46

    nat_oil_comp_state                  0.76         0.24

    X_train[dummy_cols].count()

    democracy_polity           121

    autocracy_polity           121

    democracy                  123

    nat_oil_comp               121

    nat_oil_comp_state         121

  2. We need to build a column transformer to handle different data types. We can use SimpleImputer for the categorical features and numerical features, except for motorization_rate. We will use KNN imputation for the motorization_rate feature later:

    standtrans = make_pipeline(OutlierTrans(2),

    SimpleImputer(strategy="median"), StandardScaler())

    cattrans = make_pipeline(SimpleImputer(strategy="most_frequent"))

    spectrans = make_pipeline(OutlierTrans(2), StandardScaler())

    coltrans = ColumnTransformer(

      transformers=[

        ("stand", standtrans, num_cols),

        ("cat", cattrans, dummy_cols),

        ("spec", spectrans, spec_cols)

      ]

    )

  3. Now, we are ready to fit our linear SVR model. We will choose a value of 0.2 for epsilon. This means that we are fine with any error within 0.2 standard deviations of the actual value (we use TransformedTargetRegressor to standardize the target). We will leave C – the hyperparameter determining our model’s tolerance for values outside of epsilon – at its default value of 1.0.

Before we fit our model, we still need to handle missing values for motorization_rate. We will add the KNN imputer to a pipeline after the column transformations. Since motorization_rate will be the only feature with missing values after the column transformations, the KNN imputer only changes values for that feature.

We need to use the target transformer because the column transformer will only change the features, not the target. We will pass the pipeline we just created to the target transformer’s regressor parameter to do the feature transformations, and indicate that we just want to do standard scaling for the target.

Note that the default loss function for linear SVR is L1, but we could have chosen L2 instead:

svr = LinearSVR(epsilon=0.2, max_iter=10000,

  random_state=0)

pipe1 = make_pipeline(coltrans,

  KNNImputer(n_neighbors=5), svr)

ttr=TransformedTargetRegressor(regressor=pipe1,

  transformer=StandardScaler())

ttr.fit(X_train, y_train)

  1. We can use ttr.regressor_ to access all the elements of the pipeline, including the linearsvr object. This is how we get to the coef_ attribute. The coefficients that are substantially different from 0 are VAT_Rate and the autocracy and national oil company dummies. Our model estimates a positive relationship between value-added tax rates and gasoline taxes, all else being equal. It estimates a negative relationship between having an autocracy or having a national oil company, and gasoline taxes:

    coefs = ttr.regressor_['linearsvr'].coef_

    np.column_stack((coefs.ravel(), num_cols + dummy_cols + spec_cols))

    array([['-0.03040694175014407', 'fuel_income_dependence'],

           ['0.10549935644031803', 'national_income_per_cap'],

           ['0.49519936241642026', 'VAT_Rate'],

           ['0.0857845735264331', 'gov_debt_per_gdp'],

           ['0.018198547504343885', 'polity'],

           ['0.12656984468734492', 'goveffect'],

           ['-0.09889163752261303', 'democracy_index'],

           ['-0.036584519840546594', 'democracy_polity'],

           ['-0.5446613604546718', 'autocracy_polity'],

           ['0.033234557366924815', 'democracy'],

           ['-0.2048732386478349', 'nat_oil_comp'],

           ['-0.6142887840649164', 'nat_oil_comp_state'],

           ['0.14488410358761755', 'motorization_rate']], dtype='<U32')

Notice that we have not done any feature selection here. Instead, we are relying on the L1 regularization to push feature coefficients to near 0. If we had many more features, or we were more concerned about computation time, it would be important to think about our feature selection strategy more carefully.

  1. Let’s do some cross-validation on this model. The mean absolute error and r-squared are not great, though that is certainly impacted by the small sample size:

    kf = KFold(n_splits=3, shuffle=True, random_state=0)

    ttr.fit(X_train, y_train)

    scores = cross_validate(ttr, X=X_train, y=y_train,

      cv=kf, scoring=('r2', 'neg_mean_absolute_error'),

       n_jobs=1)

    print("Mean Absolute Error: %.2f, R-squared: %.2f" %

      (scores['test_neg_mean_absolute_error'].mean(),

      scores['test_r2'].mean()))

    Mean Absolute Error: -0.26, R-squared: 0.57

We have not done any hyperparameter tuning yet. We do not know if our values for epsilon and C are the best ones for our model. Therefore, we need to do a grid search to experiment with different hyperparameter values. We will start with an exhaustive grid search, which often is not practical ( I recommend not running the next few steps on your machine unless you have a fairly high-performing one). After the exhaustive grid search, we will do a randomized grid search, which is usually substantially easier on system resources.

  1. We will start by creating a LinearSVR object without the epsilon hyperparameter specified, and we will recreate the pipeline. Then, we will create a dictionary, svr_params, with values to check for epsilon and C, called regressor_linearsvr_epsilon and regressor_linearsvr_C, respectively.

Remember from our grid search from the previous chapter that the names of the keys must correspond with our pipeline steps. Our pipeline, which in this case can be accessed as the transformed target’s regressor object, has a linearsvr object with attributes for epsilon and C.

We will pass the svr_params dictionary to a GridSearchCV object and indicate that we want the scoring to be based on r-squared (if we wanted to base scoring on the mean absolute error, we could have also done that).

Then, we will run the fit method of the grid search object. I should repeat the warning I mentioned previously that you may not want to run an exhaustive grid search unless you are using a high-performing machine, or you do not mind letting it run while you go get a cup of coffee. Note that it takes about 26 seconds to run each time on my machine:

svr = LinearSVR(max_iter=100000, random_state=0)

pipe1 = make_pipeline(coltrans,

  KNNImputer(n_neighbors=5), svr)

ttr=TransformedTargetRegressor(regressor=pipe1,

  transformer=StandardScaler())

svr_params = {

  'regressor__linearsvr__epsilon': np.arange(0.1, 1.6, 0.1),

  'regressor__linearsvr__C': np.arange(0.1, 1.6, 0.1)

}

gs = GridSearchCV(ttr,param_grid=svr_params, cv=3,

  scoring='r2')

%timeit gs.fit(X_train, y_train)

26.2 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  1. Now, we can use the best_params_ attribute of the grid search to get the hyperparameters associated with the highest score. We can see the score with those parameters with the best_scores_ attribute. This tells us that we get the highest r-squared, which is 0.6, with a C of 0.1 and an epsilon value of 0.2:

    gs.best_params_

    {'regressor__linearsvr__C': 0.1, 'regressor__linearsvr__epsilon': 0.2}

    gs.best_score_

    0.599751107082899

It is good to know which values to choose for our hyperparameters. However, the exhaustive grid search was quite expensive computationally. Let’s try a randomized search instead.

  1. We will indicate that the random values for both epsilon and C should come from a uniform distribution with values between 0 and 1.5. Then, we will pass that dictionary to a RandomizedSearchCV object. This runs substantially faster than the exhaustive grid search – a little over 1 second per iteration. This gives us higher epsilon and C values than the exhaustive grid search – that is, 0.23 and 0.7, respectively. The r-squared value is a little lower, however:

    svr_params = {

    'regressor__linearsvr__epsilon': uniform(loc=0, scale=1.5),

    'regressor__linearsvr__C': uniform(loc=0, scale=1.5)

    }

    rs = RandomizedSearchCV(ttr, svr_params, cv=3, scoring='r2')

    %timeit rs.fit(X_train, y_train)

    1.21 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    rs.best_params_

    {'regressor__linearsvr__C': 0.23062453444814285,

    'regressor__linearsvr__epsilon': 0.6976844872643301}

    rs.best_score_

    0.5785452537781279

  2. Let’s look at the predictions based on the best model from the randomized grid search. The randomized grid search object’s predict method can generate those predictions for us:

    pred = rs.predict(X_test)

    preddf = pd.DataFrame(pred, columns=['prediction'],

      index=X_test.index).join(X_test).join(y_test)

    preddf['resid'] = preddf.gas_tax_imp-preddf.prediction

  3. Now, let’s look at the distribution of our residuals:

    plt.hist(preddf.resid, color="blue", bins=np.arange(-0.5,1.0,0.25))

    plt.axvline(preddf.resid.mean(), color='red', linestyle='dashed', linewidth=1)

    plt.title("Histogram of Residuals for Gas Tax Model")

    plt.xlabel("Residuals")

    plt.ylabel("Frequency")

    plt.xlim()

    plt.show()

This produces the following plot:

Figure 8.7 – Residual distribution for the gasoline tax linear SVR model

Figure 8.7 – Residual distribution for the gasoline tax linear SVR model

Here, there is a little bit of bias (some overpredicting overall) and some positive skew.

  1. Let’s also view a scatterplot of the predicted values against the residuals:

    plt.scatter(preddf.prediction, preddf.resid, color="blue")

    plt.axhline(0, color='red', linestyle='dashed', linewidth=1)

    plt.title("Scatterplot of Predictions and Residuals")

    plt.xlabel("Predicted Gas Tax")

    plt.ylabel("Residuals")

    plt.show()

This produces the following plot:

Figure 8.8 – Scatterplot of predictions and residuals for the gasoline tax linear SVR model

Figure 8.8 – Scatterplot of predictions and residuals for the gasoline tax linear SVR model

These residuals are problematic. We are always overpredicting (predicted values are higher than actual values) at the lower and upper range of the predicted values. This is not what we want and is perhaps warning us of an unaccounted-for nonlinear relationship.

When our data is linearly separable, linear SVR can be an efficient choice. It can be used in many of the same situations where we would have used linear regression or linear regression with regularization. Its relative efficiency means we are not as concerned about using it with datasets that contain more than 10,000 observations as we are with nonlinear SVR. However, when linear separability is not possible, we should explore nonlinear models.

Using kernels for nonlinear SVR

Recall from our discussion at the beginning of this chapter that we can use a kernel function to fit a nonlinear epsilon-insensitive tube. In this section, we will run a nonlinear SVR with the land temperatures data that we worked with in the previous chapter. But first, we will construct a linear SVR with the same data for comparison.

We will model the average temperature for weather stations as a function of latitude and elevation. Follow these steps:

  1. We will begin by loading the familiar libraries. The only new class is SVR from scikit-learn:

    import pandas as pd

    import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.svm import LinearSVR, SVR

    from scipy.stats import uniform

    from sklearn.impute import SimpleImputer

    from sklearn.pipeline import make_pipeline

    from sklearn.compose import TransformedTargetRegressor

    from sklearn.impute import KNNImputer

    from sklearn.model_selection import RandomizedSearchCV

    import sklearn.metrics as skmet

    import matplotlib.pyplot as plt

    import os

    import sys

    sys.path.append(os.getcwd() + "/helperfunctions")

    from preprocfunc import OutlierTrans

  2. Next, we will load the land temperatures data and create training and testing DataFrames. We will also take a look at some descriptive statistics. There are several missing values for elevation and the ranges of the two features are very different. There are also some exceedingly low average temperatures:

    landtemps = pd.read_csv("data/landtempsb2019avgs.csv")

    landtemps.set_index('locationid', inplace=True)

    feature_cols = ['latabs','elevation']

    landtemps[['avgtemp'] + feature_cols].

      agg(['count','min','median','max']).T

                   count       min     median     max

    avgtemp        12,095     -61      10         34

    latabs         12,095      0       41         90

    elevation      12,088     -350     271        4,701

    X_train, X_test, y_train, y_test =  

      train_test_split(landtemps[feature_cols],

      landtemps[['avgtemp']], test_size=0.1, random_state=0)

  3. Let’s start with a linear SVR model of average temperatures. We can be fairly conservative with how we handle the outliers, only setting them to missing when the interquartile range is more than three times above or below the interquartile range. (We created the OutlierTrans class in Chapter 7, Linear Regression Models.) We will use KNN imputation for the missing elevation values and scale the data. Remember that we need to use the target transformer to scale the target variable.

Just as we did in the previous section, we will use a dictionary, svr_params, to indicate that we want to sample values from a uniform distribution for our hyperparameters – that is, epsilon and C. We will pass this dictionary to the RandomizedSearchCV object.

After running fit, we can get the best parameters for epsilon and C, and the mean absolute error for the best model. The mean absolute error is fairly decent at about 2.8 degrees:

svr = LinearSVR(epsilon=1.0, max_iter=100000)

knnimp = KNNImputer(n_neighbors=45)

pipe1 = make_pipeline(OutlierTrans(3), knnimp, StandardScaler(), svr)

ttr=TransformedTargetRegressor(regressor=pipe1,

  transformer=StandardScaler())

svr_params = {

'regressor__linearsvr__epsilon': uniform(loc=0, scale=1.5),

'regressor__linearsvr__C': uniform(loc=0, scale=20)

}

rs = RandomizedSearchCV(ttr, svr_params, cv=10, scoring='neg_mean_absolute_error')

rs.fit(X_train, y_train)

rs.best_params_

{'regressor__linearsvr__C': 15.07662849482442,

'regressor__linearsvr__epsilon': 0.06750238486004034}

rs.best_score_

-2.769283402595076

  1. Let’s look at the predictions:

    pred = rs.predict(X_test)

    preddf = pd.DataFrame(pred, columns=['prediction'],

      index=X_test.index).join(X_test).join(y_test)

    preddf['resid'] = preddf.avgtemp-preddf.prediction

    plt.scatter(preddf.prediction, preddf.resid, color="blue")

    plt.axhline(0, color='red', linestyle='dashed', linewidth=1)

    plt.title("Scatterplot of Predictions and Residuals")

    plt.xlabel("Predicted Gas Tax")

    plt.ylabel("Residuals")

    plt.show()

This produces the following plot:

Figure 8.9 – Scatterplot of predictions and residuals for the land temperatures linear SVR model

Figure 8.9 – Scatterplot of predictions and residuals for the land temperatures linear SVR model

There is a good amount of overpredicting at the upper range of the predicted values. We typically underpredict values just below that, between predicted gas tax values from 15 to 25 degrees. Perhaps we can improve the fit with a nonlinear model.

  1. We do not have to change much to run a nonlinear SVR. We just need to create an SVR object and choose a kernel function. rbf is typically selected. (You may not want to fit this model on your machine unless you are using good hardware, or do not mind doing something else for a while and coming back for your results.) Take a look at the following code:

    svr = SVR(kernel='rbf')

    pipe1 = make_pipeline(OutlierTrans(3), knnimp, StandardScaler(), svr)

    ttr=TransformedTargetRegressor(regressor=pipe1,

      transformer=StandardScaler())

    svr_params = {

    'regressor__svr__epsilon': uniform(loc=0, scale=5),

    'regressor__svr__C': uniform(loc=0, scale=20),

    'regressor__svr__gamma': uniform(loc=0, scale=100)

    }

    rs = RandomizedSearchCV(ttr, svr_params, cv=10, scoring='neg_mean_absolute_error')

    rs.fit(X_train, y_train)

    rs.best_params_

    {'regressor__svr__C': 5.3715128489311255,

    'regressor__svr__epsilon': 0.03997496426101643,

    'regressor__svr__gamma': 53.867632383007994}

    rs.best_score_

    -2.1319240416548775

There is a noticeable improvement in terms of the mean absolute error. Here, we can see that the gamma and C hyperparameters are doing a fair bit of work for us. If we are okay being about 2 degrees off on average, this model gets us there.

We go into much more detail regarding the gamma and C hyperparameters in Chapter 13, Support Vector Machine Classification. We also explore other kernels besides the rbf kernel.

  1. Let’s look again at the residuals to see if there is something problematic in how our errors are distributed, as was the case with our linear model:

    pred = rs.predict(X_test)

    preddf = pd.DataFrame(pred, columns=['prediction'],

      index=X_test.index).join(X_test).join(y_test)

    preddf['resid'] = preddf.avgtemp-preddf.prediction

    plt.scatter(preddf.prediction, preddf.resid, color="blue")

    plt.axhline(0, color='red', linestyle='dashed', linewidth=1)

    plt.title("Scatterplot of Predictions and Residuals")

    plt.xlabel("Predicted Gas Tax")

    plt.ylabel("Residuals")

    plt.show()

This produces the following plot:

Figure 8.10 – Scatterplot of predictions and residuals for the land temperatures nonlinear SVR model

Figure 8.10 – Scatterplot of predictions and residuals for the land temperatures nonlinear SVR model

These residuals look substantially better than those for the linear model.

This illustrates how using a kernel function can increase the complexity of our model without us having to increase the feature space. By using the rbf kernel and adjusting the C and gamma hyperparameters, we address some of the underfitting we saw with the linear model. This is one of the great advantages of nonlinear SVR. The disadvantage, as we also saw, was that it was quite taxing on system resources. A dataset that contains 12,000 observations is at the upper limit of what can be handled easily with nonlinear SVR, particularly with a grid search for the best hyperparameters.

Summary

The examples in this chapter illustrated some of the advantages of SVR. The algorithm allows us to adjust hyperparameters to address underfitting or overfitting. This can be done without increasing the number of features. SVR is also less sensitive to outliers than methods such as linear regression.

When we can build a good model with linear SVR, it is a perfectly reasonable choice. It can be trained much faster than a nonlinear model. However, we can often improve performance with a nonlinear SVR, as we saw in the last section of this chapter.

This discussion leads us to what we will explore in the next chapter, where we will look at two popular non-parametric regression algorithms: k-nearest neighbors and decision tree regression. These two algorithms make almost no assumptions about the distribution of our features and targets. Similar to SVR, they can capture complicated relationships in the data without increasing the feature space.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset