Support vector regression (SVR) can be an excellent option when the assumptions of linear regression models do not hold, such as when the relationship between our features and our target is too complicated to be described by a linear combination of weights. Even better, SVR allows us to model that complexity without having to expand the feature space.
Support vector machines identify the hyperplane that maximizes the margin between two classes. The support vectors are the data points closest to the margin that support it, if you will. This turns out to be as useful for regression modeling as it is for classification. SVR finds the hyperplane containing the greatest number of data points. We will discuss how that works in the first section of this chapter.
Rather than minimizing the sum of the squared residuals, as ordinary least squares regression does, SVR minimizes the coefficients within an acceptable error range. Like ridge and lasso regression, this can reduce model variance and the risk of overfitting. SVR works best when we are working with a small- to medium-sized dataset.
The algorithm is also quite flexible, allowing us to specify the acceptable error range, use kernels to model nonlinear relationships, and adjust hyperparameters to get the best bias-variance tradeoff possible. We will demonstrate that in this chapter.
In this chapter, we will cover the following topics:
In this chapter, we will be working with the scikit-learn and matplotlib libraries. You can use pip to install these packages.
We will start this section by discussing how support vector machines are used for classification. We will not go into much detail here, leaving a detailed discussion of support vector classification to Chapter 13, Support Vector Machine Classification. But starting with support vector machines for classification will lead nicely to an explanation of SVR.
As I discussed at the beginning of this chapter, support vector machines find the hyperplane that maximizes the margin between classes. When there are only two features present, that hyperplane is just a line. Consider the following example plot:
The two classes in this diagram, represented by red circles and blue squares, are linearly separable using the two features, x1 and x2. The bold line is the decision boundary. It is the line that is furthest away from border data points for each class, or the maximum margin. These points are known as the support vectors.
Since the data in the preceding plot is linearly separable, we can use what is known as hard margin classification without problems; that is, we can be strict about all the observations for each class being on the correct side of the decision boundary. But what if our data points look like what’s shown in the following plot?
These data points are not linearly separable. In this case, we can choose soft margin classification and ignore the outlier red circles.
We will discuss support vector classification in much greater detail in Chapter 13, Support Vector Machine Classification, but this illustrates some of the key support vector machine concepts. These concepts can be applied well to models involving a continuous target. This is called support vector regression or SVR.
When building an SVR model, we decide on the acceptable amount of prediction error, ɛ. Errors within ɛ of our prediction, , in a one-feature model are not penalized. This is sometimes referred to as the epsilon-insensitive tube. SVR minimizes the coefficients consistent with all data points falling within that range. This is illustrated in the following plot:
Stated more precisely, SVR minimizes the square of the coefficients, subject to the constraint that the error, ε, does not exceed a given amount.
It minimizes with the constraint that , where is a vector of weights (or coefficients), is the actual target value minus the predicted value, and is the acceptable amount of error.
Of course, it is not reasonable to expect all the data points to fall within the desired range. But we can still seek to minimize that deviation. Let’s denote the distance of the wayward points from the margin as ξ. This gives us a new objective function.
We minimize with the constraint that , where C is a hyperparameter indicating how tolerant the model should be of errors outside the margin. A value of 0 for C means that it is not at all tolerant of those large errors. This is equivalent to the original objective function:
Here, we can see several advantages of SVR. It is sometimes more important that our errors will not exceed a certain amount, than picking a model with the lowest absolute error. It may matter more if we are often off by a little but rarely by a lot than if we are often spot on but occasionally way off. Since this approach also minimizes our weights, it has the same advantages as regularization, and we reduce the likelihood of overfitting.
We have not yet fully addressed the issue of linear separability with SVR. For simplicity, we will return to a classification problem involving two features. Let’s look at a plot of two features against a categorical target. The target has two possible values, represented by the dots and squares. x1 and x2 are numeric and have negative values:
What can we do in a case like this to identify a margin between the classes? It is often the case that a margin can be identified at a higher dimension. In this example, we can use a polynomial transformation, as illustrated in the following plot:
There is now a third dimension, which is the sum of the squares of x1 and x2. The dots are all higher than the squares. This is similar to how we used polynomial transformation in the previous chapter to specify a nonlinear regression model.
One drawback of this approach is that we can quickly end up with too many features for our model to perform well. This is where the kernel trick comes in very handy. SVR can use a kernel function to expand the feature space implicitly without actually creating more features. This is done by creating a vector of values that can be used to fit a nonlinear margin.
While this allows us to fit a polynomial transformation such as the hypothetical one illustrated in the preceding plot, the most frequently used kernel function with SVR is the radial basis function (RBF). RBF is popular because it is faster than the other common kernel functions and because its gamma parameter makes it very flexible. We will explore how to use it in the last section of this chapter.
But for now, let’s start with a relatively straightforward linear model to see SVR in action.
We often have enough domain knowledge to take an approach that is more nuanced than simply minimizing prediction errors in our training data. Using this knowledge may allow us to accept more bias in our model, when small amounts of bias do not matter much substantively, to reduce variance. With SVR, we can adjust hyperparameters such as epsilon (the acceptable error range) and C (which adjusts the tolerance for errors outside of that range) to improve our model’s performance.
If a linear model can perform well on your data, linear SVR might be a good choice. We can build a linear SVR model with scikit-learn’s LinearSVR class. Let’s try creating a linear SVR model with the gasoline tax data that we used in the previous chapter:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR
from scipy.stats import uniform
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_validate,
KFold, GridSearchCV, RandomizedSearchCV
import sklearn.metrics as skmet
import matplotlib.pyplot as plt
import os
import sys
sys.path.append(os.getcwd() + "/helperfunctions")
from preprocfunc import OutlierTrans
This dataset contains gasoline tax data for each country in 2014, as well as fuel income dependence and measures of the strength of the democratic institutions: polity, democracy_polity, and autocracy_polity. democracy_polity is a binarized polity variable, taking on a value of 1 for countries with high polity scores. autocracy_polity has a value of 1 for countries with low polity scores. The polity feature is a measure of how democratic a country is:
fftaxrate14 = pd.read_csv("data/fossilfueltaxrate14.csv")
fftaxrate14.set_index('countrycode', inplace=True)
num_cols = ['fuel_income_dependence',
'national_income_per_cap', 'VAT_Rate',
'gov_debt_per_gdp', 'polity','goveffect',
'democracy_index']
dummy_cols = 'democracy_polity','autocracy_polity',
'democracy', 'nat_oil_comp','nat_oil_comp_state']
spec_cols = ['motorization_rate']
target = fftaxrate14[['gas_tax_imp']]
features = fftaxrate14[num_cols + dummy_cols + spec_cols]
X_train, X_test, y_train, y_test =
train_test_split(features,
target, test_size=0.2, random_state=0)
X_train.shape
(123, 13)
X_train[num_cols + spec_cols].
agg(['count','min','median','max']).T
count min median max
fuel_income_dependence 121 0.00 0.10 34.23
national_income_per_cap 121 260.00 6,110.00 104,540.00
VAT_Rate 121 0.00 16.00 27.00
gov_debt_per_gdp 112 1.56 38.45 194.76
polity 121 -10.00 6.00 10.00
goveffect 123 -2.04 -0.10 2.18
democracy_index 121 0.03 0.54 0.93
motorization_rate 100 0.00 0.20 0.81
X_train[dummy_cols].apply(pd.value_counts, normalize=True).T
0.00 1.00
democracy_polity 0.42 0.58
autocracy_polity 0.88 0.12
democracy 0.41 0.59
nat_oil_comp 0.54 0.46
nat_oil_comp_state 0.76 0.24
X_train[dummy_cols].count()
democracy_polity 121
autocracy_polity 121
democracy 123
nat_oil_comp 121
nat_oil_comp_state 121
standtrans = make_pipeline(OutlierTrans(2),
SimpleImputer(strategy="median"), StandardScaler())
cattrans = make_pipeline(SimpleImputer(strategy="most_frequent"))
spectrans = make_pipeline(OutlierTrans(2), StandardScaler())
coltrans = ColumnTransformer(
transformers=[
("stand", standtrans, num_cols),
("cat", cattrans, dummy_cols),
("spec", spectrans, spec_cols)
]
)
Before we fit our model, we still need to handle missing values for motorization_rate. We will add the KNN imputer to a pipeline after the column transformations. Since motorization_rate will be the only feature with missing values after the column transformations, the KNN imputer only changes values for that feature.
We need to use the target transformer because the column transformer will only change the features, not the target. We will pass the pipeline we just created to the target transformer’s regressor parameter to do the feature transformations, and indicate that we just want to do standard scaling for the target.
Note that the default loss function for linear SVR is L1, but we could have chosen L2 instead:
svr = LinearSVR(epsilon=0.2, max_iter=10000,
random_state=0)
pipe1 = make_pipeline(coltrans,
KNNImputer(n_neighbors=5), svr)
ttr=TransformedTargetRegressor(regressor=pipe1,
transformer=StandardScaler())
ttr.fit(X_train, y_train)
coefs = ttr.regressor_['linearsvr'].coef_
np.column_stack((coefs.ravel(), num_cols + dummy_cols + spec_cols))
array([['-0.03040694175014407', 'fuel_income_dependence'],
['0.10549935644031803', 'national_income_per_cap'],
['0.49519936241642026', 'VAT_Rate'],
['0.0857845735264331', 'gov_debt_per_gdp'],
['0.018198547504343885', 'polity'],
['0.12656984468734492', 'goveffect'],
['-0.09889163752261303', 'democracy_index'],
['-0.036584519840546594', 'democracy_polity'],
['-0.5446613604546718', 'autocracy_polity'],
['0.033234557366924815', 'democracy'],
['-0.2048732386478349', 'nat_oil_comp'],
['-0.6142887840649164', 'nat_oil_comp_state'],
['0.14488410358761755', 'motorization_rate']], dtype='<U32')
Notice that we have not done any feature selection here. Instead, we are relying on the L1 regularization to push feature coefficients to near 0. If we had many more features, or we were more concerned about computation time, it would be important to think about our feature selection strategy more carefully.
kf = KFold(n_splits=3, shuffle=True, random_state=0)
ttr.fit(X_train, y_train)
scores = cross_validate(ttr, X=X_train, y=y_train,
cv=kf, scoring=('r2', 'neg_mean_absolute_error'),
n_jobs=1)
print("Mean Absolute Error: %.2f, R-squared: %.2f" %
(scores['test_neg_mean_absolute_error'].mean(),
scores['test_r2'].mean()))
Mean Absolute Error: -0.26, R-squared: 0.57
We have not done any hyperparameter tuning yet. We do not know if our values for epsilon and C are the best ones for our model. Therefore, we need to do a grid search to experiment with different hyperparameter values. We will start with an exhaustive grid search, which often is not practical ( I recommend not running the next few steps on your machine unless you have a fairly high-performing one). After the exhaustive grid search, we will do a randomized grid search, which is usually substantially easier on system resources.
Remember from our grid search from the previous chapter that the names of the keys must correspond with our pipeline steps. Our pipeline, which in this case can be accessed as the transformed target’s regressor object, has a linearsvr object with attributes for epsilon and C.
We will pass the svr_params dictionary to a GridSearchCV object and indicate that we want the scoring to be based on r-squared (if we wanted to base scoring on the mean absolute error, we could have also done that).
Then, we will run the fit method of the grid search object. I should repeat the warning I mentioned previously that you may not want to run an exhaustive grid search unless you are using a high-performing machine, or you do not mind letting it run while you go get a cup of coffee. Note that it takes about 26 seconds to run each time on my machine:
svr = LinearSVR(max_iter=100000, random_state=0)
pipe1 = make_pipeline(coltrans,
KNNImputer(n_neighbors=5), svr)
ttr=TransformedTargetRegressor(regressor=pipe1,
transformer=StandardScaler())
svr_params = {
'regressor__linearsvr__epsilon': np.arange(0.1, 1.6, 0.1),
'regressor__linearsvr__C': np.arange(0.1, 1.6, 0.1)
}
gs = GridSearchCV(ttr,param_grid=svr_params, cv=3,
scoring='r2')
%timeit gs.fit(X_train, y_train)
26.2 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
gs.best_params_
{'regressor__linearsvr__C': 0.1, 'regressor__linearsvr__epsilon': 0.2}
gs.best_score_
0.599751107082899
It is good to know which values to choose for our hyperparameters. However, the exhaustive grid search was quite expensive computationally. Let’s try a randomized search instead.
svr_params = {
'regressor__linearsvr__epsilon': uniform(loc=0, scale=1.5),
'regressor__linearsvr__C': uniform(loc=0, scale=1.5)
}
rs = RandomizedSearchCV(ttr, svr_params, cv=3, scoring='r2')
%timeit rs.fit(X_train, y_train)
1.21 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
rs.best_params_
{'regressor__linearsvr__C': 0.23062453444814285,
'regressor__linearsvr__epsilon': 0.6976844872643301}
rs.best_score_
0.5785452537781279
pred = rs.predict(X_test)
preddf = pd.DataFrame(pred, columns=['prediction'],
index=X_test.index).join(X_test).join(y_test)
preddf['resid'] = preddf.gas_tax_imp-preddf.prediction
plt.hist(preddf.resid, color="blue", bins=np.arange(-0.5,1.0,0.25))
plt.axvline(preddf.resid.mean(), color='red', linestyle='dashed', linewidth=1)
plt.title("Histogram of Residuals for Gas Tax Model")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.xlim()
plt.show()
This produces the following plot:
Here, there is a little bit of bias (some overpredicting overall) and some positive skew.
plt.scatter(preddf.prediction, preddf.resid, color="blue")
plt.axhline(0, color='red', linestyle='dashed', linewidth=1)
plt.title("Scatterplot of Predictions and Residuals")
plt.xlabel("Predicted Gas Tax")
plt.ylabel("Residuals")
plt.show()
This produces the following plot:
These residuals are problematic. We are always overpredicting (predicted values are higher than actual values) at the lower and upper range of the predicted values. This is not what we want and is perhaps warning us of an unaccounted-for nonlinear relationship.
When our data is linearly separable, linear SVR can be an efficient choice. It can be used in many of the same situations where we would have used linear regression or linear regression with regularization. Its relative efficiency means we are not as concerned about using it with datasets that contain more than 10,000 observations as we are with nonlinear SVR. However, when linear separability is not possible, we should explore nonlinear models.
Recall from our discussion at the beginning of this chapter that we can use a kernel function to fit a nonlinear epsilon-insensitive tube. In this section, we will run a nonlinear SVR with the land temperatures data that we worked with in the previous chapter. But first, we will construct a linear SVR with the same data for comparison.
We will model the average temperature for weather stations as a function of latitude and elevation. Follow these steps:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR, SVR
from scipy.stats import uniform
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.impute import KNNImputer
from sklearn.model_selection import RandomizedSearchCV
import sklearn.metrics as skmet
import matplotlib.pyplot as plt
import os
import sys
sys.path.append(os.getcwd() + "/helperfunctions")
from preprocfunc import OutlierTrans
landtemps = pd.read_csv("data/landtempsb2019avgs.csv")
landtemps.set_index('locationid', inplace=True)
feature_cols = ['latabs','elevation']
landtemps[['avgtemp'] + feature_cols].
agg(['count','min','median','max']).T
count min median max
avgtemp 12,095 -61 10 34
latabs 12,095 0 41 90
elevation 12,088 -350 271 4,701
X_train, X_test, y_train, y_test =
train_test_split(landtemps[feature_cols],
landtemps[['avgtemp']], test_size=0.1, random_state=0)
Just as we did in the previous section, we will use a dictionary, svr_params, to indicate that we want to sample values from a uniform distribution for our hyperparameters – that is, epsilon and C. We will pass this dictionary to the RandomizedSearchCV object.
After running fit, we can get the best parameters for epsilon and C, and the mean absolute error for the best model. The mean absolute error is fairly decent at about 2.8 degrees:
svr = LinearSVR(epsilon=1.0, max_iter=100000)
knnimp = KNNImputer(n_neighbors=45)
pipe1 = make_pipeline(OutlierTrans(3), knnimp, StandardScaler(), svr)
ttr=TransformedTargetRegressor(regressor=pipe1,
transformer=StandardScaler())
svr_params = {
'regressor__linearsvr__epsilon': uniform(loc=0, scale=1.5),
'regressor__linearsvr__C': uniform(loc=0, scale=20)
}
rs = RandomizedSearchCV(ttr, svr_params, cv=10, scoring='neg_mean_absolute_error')
rs.fit(X_train, y_train)
rs.best_params_
{'regressor__linearsvr__C': 15.07662849482442,
'regressor__linearsvr__epsilon': 0.06750238486004034}
rs.best_score_
-2.769283402595076
pred = rs.predict(X_test)
preddf = pd.DataFrame(pred, columns=['prediction'],
index=X_test.index).join(X_test).join(y_test)
preddf['resid'] = preddf.avgtemp-preddf.prediction
plt.scatter(preddf.prediction, preddf.resid, color="blue")
plt.axhline(0, color='red', linestyle='dashed', linewidth=1)
plt.title("Scatterplot of Predictions and Residuals")
plt.xlabel("Predicted Gas Tax")
plt.ylabel("Residuals")
plt.show()
This produces the following plot:
There is a good amount of overpredicting at the upper range of the predicted values. We typically underpredict values just below that, between predicted gas tax values from 15 to 25 degrees. Perhaps we can improve the fit with a nonlinear model.
svr = SVR(kernel='rbf')
pipe1 = make_pipeline(OutlierTrans(3), knnimp, StandardScaler(), svr)
ttr=TransformedTargetRegressor(regressor=pipe1,
transformer=StandardScaler())
svr_params = {
'regressor__svr__epsilon': uniform(loc=0, scale=5),
'regressor__svr__C': uniform(loc=0, scale=20),
'regressor__svr__gamma': uniform(loc=0, scale=100)
}
rs = RandomizedSearchCV(ttr, svr_params, cv=10, scoring='neg_mean_absolute_error')
rs.fit(X_train, y_train)
rs.best_params_
{'regressor__svr__C': 5.3715128489311255,
'regressor__svr__epsilon': 0.03997496426101643,
'regressor__svr__gamma': 53.867632383007994}
rs.best_score_
-2.1319240416548775
There is a noticeable improvement in terms of the mean absolute error. Here, we can see that the gamma and C hyperparameters are doing a fair bit of work for us. If we are okay being about 2 degrees off on average, this model gets us there.
We go into much more detail regarding the gamma and C hyperparameters in Chapter 13, Support Vector Machine Classification. We also explore other kernels besides the rbf kernel.
pred = rs.predict(X_test)
preddf = pd.DataFrame(pred, columns=['prediction'],
index=X_test.index).join(X_test).join(y_test)
preddf['resid'] = preddf.avgtemp-preddf.prediction
plt.scatter(preddf.prediction, preddf.resid, color="blue")
plt.axhline(0, color='red', linestyle='dashed', linewidth=1)
plt.title("Scatterplot of Predictions and Residuals")
plt.xlabel("Predicted Gas Tax")
plt.ylabel("Residuals")
plt.show()
This produces the following plot:
These residuals look substantially better than those for the linear model.
This illustrates how using a kernel function can increase the complexity of our model without us having to increase the feature space. By using the rbf kernel and adjusting the C and gamma hyperparameters, we address some of the underfitting we saw with the linear model. This is one of the great advantages of nonlinear SVR. The disadvantage, as we also saw, was that it was quite taxing on system resources. A dataset that contains 12,000 observations is at the upper limit of what can be handled easily with nonlinear SVR, particularly with a grid search for the best hyperparameters.
The examples in this chapter illustrated some of the advantages of SVR. The algorithm allows us to adjust hyperparameters to address underfitting or overfitting. This can be done without increasing the number of features. SVR is also less sensitive to outliers than methods such as linear regression.
When we can build a good model with linear SVR, it is a perfectly reasonable choice. It can be trained much faster than a nonlinear model. However, we can often improve performance with a nonlinear SVR, as we saw in the last section of this chapter.
This discussion leads us to what we will explore in the next chapter, where we will look at two popular non-parametric regression algorithms: k-nearest neighbors and decision tree regression. These two algorithms make almost no assumptions about the distribution of our features and targets. Similar to SVR, they can capture complicated relationships in the data without increasing the feature space.