3
By the end of this chapter, you will be able to:
This chapter covers the fundamentals of linear and polynomial regression.
Regression is a broad topic that connects mathematical statistics, data science, machine learning, and artificial intelligence. As the basics of regression are rooted in mathematics, we will start by exploring the mathematical fundamentals.
Most of this topic will deal with different forms of linear regression, including linear regression with one variable, linear regression with multiple variables, polynomial regression with one variable, and polynomial regression with multiple variables. Python provides a lot of support for performing regression operations.
We will also use alternative regression models while comparing and contrasting support vector Regression with forms of Linear Regression. Throughout this chapter, we will use stock price data loaded from an online service provider. The models in this chapter are not intended to provide trading or investment advice.
Although it is not suggested to use the models in this chapter to provide trading or investment advice, it is a very exciting and interesting journey that explains the fundamentals of regression.
A general regression problem can be defined as follows. Suppose we have a set of data points. We need to figure out a best fit curve to approximately fit the given data points. This curve will describe the relationship between our input variable x, which is the data points, and output variable y, which is the curve.
In real life, we often have multiple input variables determining one output variable. Regression helps us understand how the output variable changes when we keep all but one input variable fixed, and we change the remaining input variable.
In this chapter, we will work with regression on the two-dimensional plane. This means that our data points are two-dimensional, and we are looking for a curve to approximate how to calculate one variable from another.
We will learn about the following types of regression:
In this topic, we will deal with the first type of linear regression: we will use one variable, and the polynomial of the regression will describe a straight line.
On the two-dimensional plane, we will use the Déscartes coordinate system, more commonly known as the Cartesian coordinate system. We have an X and a Y axis, and the intersection of these two axes is the origin. We denote points by their X and Y coordinates.
For instance, the point (2, 1) corresponds to the orange point on the following coordinate system:
A straight line can be described with the equation y = a*x + b, where a is the slope of the equation, determining how steeply the equation climbs up, and b is a constant determining where the line intersects the Y axis
In the following diagram, you can see three equations:
You can see that all three equations intersect the y-axis at 1, and their slope is determined by the factor by which we multiply x.
If you know x, you can figure out y. If you know y, you can figure out x:
We can describe more complex curves with equations too:
If you would like to experiment more with the Cartesian coordinate system, you can use the following plotter: https://s3-us-west-2.amazonaws.com/oerfiles/College+Algebra/calculator.html.
In machine learning, we differentiate between features and labels. Features are considered our input variables, and labels are our output variables.
When talking about regression, the possible values of labels is a continuous set of rational numbers.
Think about features as values on the X-axis, and labels as the value on the Y-axis.
The task of regression is to predict label values based on feature values. We often create a label by shifting values of a feature forward. For instance, if we would like to predict stock prices in 1 month, and we create the label by shifting the stock price feature 1 month to the future, then:
We must drop the last month, because we cannot use these values for the prediction.
At times, we have multiple features that may have values within completely different ranges. Imagine comparing micrometers on a map to kilometers in the real world. They won't be easy to handle because of the magnitudinal difference of nine zeros.
A less dramatic difference is the difference between imperial and metric data. Pounds and kilograms, and centimeters and inches, don't compare that well.
Therefore, we often scale our features to normalized values that are easier to handle, as we can compare values of this range more easily. We scale training and testing data together. Ranges are typically scaled within [-1;1].
We will demonstrate two types of scaling:
Min-max scaling is calculated as follows:
x_scaled[n] = (x[n] - min(x)) / (max(x)-min(x))
Mean normalization is calculated as follows:
avg = sum(x) / len(x)
x_scaled[n] = (x[n] – avg) / (max(x)-min(x))
[(float(i)-avg)/(max(fibonacci)-min(fibonacci)) for i in fibonacci]
Here's an example of Min-Max and min-max:
fibonacci = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144]
# Min-Max scaling:
[(float(i)-min(fibonacci))/(max(fibonacci)-min(fibonacci)) for i in fibonacci]
[0.0,
0.006944444444444444,
0.006944444444444444,
0.013888888888888888,
0.020833333333333332,
0.034722222222222224,
0.05555555555555555,
0.09027777777777778,
0.14583333333333334,
0.2361111111111111,
0.3819444444444444,
0.6180555555555556,
1.0]
# Mean normalization:
avg = sum(fibonacci) / len(fibonacci)
# 28.923076923076923
[(float(i)-avg)/(max(fibonacci)-min(fibonacci)) for i in fibonacci]
[-0.20085470085470086,
-0.19391025641025642,
-0.19391025641025642,
-0.18696581196581197,
-0.18002136752136752,
-0.16613247863247863,
-0.1452991452991453,
-0.11057692307692307,
-0.05502136752136752,
0.035256410256410256,
0.18108974358974358,
0.4172008547008547,
0.7991452991452992]
Scaling could add to the processing time, but often it is a sensible step to add.
In the scikit-learn library, we have access to a function that scales NumPy arrays:
import numpy as np
from sklearn import preprocessing
preprocessing.scale(fibonacci)
array([-0.6925069 , -0.66856384, -0.66856384, -0.64462079, -0.62067773,
-0.57279161, -0.50096244, -0.38124715, -0.18970269, 0.12155706,
0.62436127, 1.43842524, 2.75529341])
The scale method performs mean normalization. Notice that the result is a NumPy array.
Cross-validation measures the predictive performance of a statistical model. The better the cross-validation result, the more you can trust that your model can be used to predict the future.
During cross-validation, we test our model's ability to predict the future on real test data. Test data is not used in the prediction process.
Training data is used to construct the model that predicts our results.
Once we load data from a data source, we typically separate data into a larger chunk of training data, and a smaller chunk of test data. This separation shuffles the entries of training and test data randomly. Then, it gives you an array of training features, their corresponding training labels, testing features, and their corresponding testing labels.
We can do the training-testing split using the model_selection library of scikit-learn.
Suppose in our dummy example that we have scaled Fibonacci data and its indices as labels:
features = preprocessing.scale(fibonacci)
label = np.array(range(13))
Let's use 10% of the data as test data.
from sklearn import model_selection
(x_train, x_test, y_train, y_test) =
model_selection.train_test_split(features, label, test_size=0.1)
x_train
array([-0.66856384, 0.12155706, -0.18970269, -0.64462079, 1.43842524,
2.75529341, -0.6925069 , -0.38124715, -0.57279161, -0.62067773,
-0.66856384])
x_test
array([-0.50096244, 0.62436127])
y_train
array([1, 9, 8, 3, 11, 12, 0, 7, 5, 4, 2])
y_test
array([6, 10])
With training and testing, if we get the ratios wrong, we run the risk of overfitting or underfitting the model.
Overfitting occurs when we train the model too well, and it fits the training dataset too well. The model will be very accurate on the training data, but it will not be usable in real life, because its accuracy decreases when used on any other data. The model adjusts to the random noise in the training data and assumes patterns on this noise that yield false predictions. Underfitting occurs when the model does not fit the training data well enough to recognize important characteristics of the data. As a result, it cannot make the necessary predictions on new data. One example for this is when we attempt to do linear regression on data that is not linear. For instance, Fibonacci numbers are not linear, therefore, a model on a Fibonacci-like sequence cannot be linear either.
If you remember the Cartesian coordinate system, you know that the horizontal axis is the X axis, and that the vertical axis is the Y axis. Our features are on the X axis, while our labels are on the Y axis. Therefore, we use features and X as synonyms, while labels are often denoted by Y. Therefore, x_test denotes feature test data, x_train denotes feature training data, y_test denotes label test data, and y_train denotes label training data.
We are illustrating the process of regression on a dummy example, where we only have one feature and very limited data.
As we only have one feature, we have to format x_train by reshaping it with x_train.reshape (-1,1) to a NumPy array containing one feature.
Therefore, before executing the code on fitting the best line, execute the following code:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
# array([a, b, c]).reshape(-1, 1) becomes:
# array([[a, b, c]])
Suppose we have train and test data for our features and labels.
We can fit a model on this data for performing prediction. We will now use linear regression for this purpose:
from sklearn import linear_model
linear_regression = linear_model.LinearRegression()
model = linear_regression.fit(x_train, y_train)
model.predict(x_test)
array([4.16199119, 7.54977143])
We can also calculate the score associated with the model:
model.score(x_test, y_test)
-0.17273705326696565
This score is the mean square error and represents the accuracy of the model. It represents how well we can predict features from labels.
This number indicates a very bad model. The best possible score is 1.0. A score of 0.0 can be achieved if we constantly predict the labels by ignoring the features. We will omit the mathematical background of this score in this book.
Our model does not perform well for two reasons:
We will see a lot of more accurate models in the future, and we may even reach model scores of 0.9.
One reason why NumPy arrays are handier than Python lists is that they can be treated as vectors. There are a few operations defined on vectors that can simplify our calculations. We can perform operations on vectors of similar lengths. The sum and the (vectorial) product of two vectors equals a vector, where each coordinate is the sum or (vectorial) product of the corresponding coordinates.
For example:
import numpy as np
v1 = np.array([1,2,3])
v2 = np.array([2,0,2])
v1 + v2 # array([3, 2, 5])
v1 * v2 # array([2, 0, 6])
The product of a vector and a scalar is a vector, where each coordinate is multiplied by the scalar:
v1 * 2 # array([2, 4, 6])
The second power of a vector equals the vectorial product of the vector with itself. The double asterisk denotes the power operator:
v1 ** 2 # array([1, 4, 9], dtype=int32)
Suppose we have a set of points in the plane. Our job is to find the best fit line.
Let's see two examples.
Our first example contains 13 values that seem linear in nature. We are plotting the following data:
[2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62]
If you wanted to draw a line that is the closest to these dots, your educated guess would be quite close to reality:
Our second example is the first 13 values of the Fibonacci sequence, after scaling. Although we can define a line that fits these points the closest, we can see from the distribution of the points that our model will not be too useful:
We have already learned what the equation of a straight line is: y = a * x + b
In this equation, a is the slope, and b is the y-intercept. To find the line of best fit, we have to find the co-efficients a and b.
Our job is to minimize the sum of distances from the line of best fit.
In this book, we will save the thought process behind calculating the coefficients a and b, because you will find little practical use for it. We would rather utilize the mean as the arithmetic mean of the values in a list. We can use the mean function provided by NumPy for this.
Let's find the line of best fit for these two examples:
import numpy as np
from numpy import mean
x = np.array(range(1, 14))
y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])
a = (mean(x)*mean(y) - mean(x*y)) / (mean(x) ** 2 - mean( x ** 2 ))
4.857142857142859
b = mean(y) - a*mean(x)
-2.7692307692307843
Once we plot the line y = a*x + b with the preceding coefficients, we get the following graph:
You can find a linear regression calculator at http://www.endmemo.com/statistics/lr.php. You can also check the calculator to get an idea of what lines of best fit look like on a given dataset.
Regarding the scaled Fibonacci values, the line of best fit looks as follows:
The best fit line of the second dataset clearly appears more off from anywhere outside the trained interval.
We don't have to use this method to perform linear regression. Many libraries, including scikit-learn, will help us automatize this process. Once we perform linear regression with multiple variables, we are better off using a library to perform regression for us.
NumPy Polyfit can also be used to create a line of best fit for linear regression with one variable.
Recall the calculation for the line of best fit:
import numpy as np
from numpy import mean
x = np.array(range(1, 14))
y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])
a = (mean(x)*mean(y) - mean(x*y)) / (mean(x) ** 2 - mean( x ** 2 ))
4.857142857142859
b = mean(y) - a*mean(x)
-2.7692307692307843
The equation for finding the coefficients a and b is quite long. Fortunately, numpy.polyfit performs these calculations to find the coefficients of the line of best fit. The polyfit function accepts three arguments: the array of x values, the array of y values, and the degree of polynomial to look for. As we are looking for a straight line, the highest power of x is 1 in the polynomial:
import numpy as np
x = np.array(range(1, 14))
y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])
[a, b] = np.polyfit(x, y, 1)
[4.857142857142858, -2.769230769230769]
Plotting the Results in Python
Suppose you have a set of data points and a regression line. Our task is to plot the points and the line together so that we can see the results with our own eyes.
We will use the matplotlib.pyplot library for this. This library has two important functions:
A plot with three arguments plots a segment and/or two points formatted according to the third argument
A segment is defined by two points. As x ranges between 1 and 14, it makes sense to display a segment between 0 and 15. We must substitute the value of x in the equation a*x+b to get the corresponding y values:
import matplotlib.pyplot as plot
x = np.array(range(1, 14))
y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])
a = (mean(x)*mean(y) - mean(x*y)) / (mean(x) ** 2 - mean(x ** 2))
4.857142857142859
b = mean(y) - a*mean(x)
-2.7692307692307843
# Plotting the points
plot.scatter(x, y)
# Plotting the line
plot.plot([0, 15], [b, 15*a+b])
The output is as follows:
You might have to call plot.show() to display the preceding graph. In the IPython console, the coordinate system shows up automatically.
The segment and the scattered data points are displayed as expected.
Plot has an advanced signature. You can use one call of plot to draw scattered dots, lines, and any curves on this diagram. These variables are interpreted in groups of three:
Let's create a function for deriving an array of approximated y values from an array of approximated x values:
def fitY( arr ):
return [4.857142857142859 * x - 2.7692307692307843 for x in arr]
We will use the fit function to plot values:
plot.plot(
x, y, 'go',
x, fitY(x), 'r--o'
)
The output is as follows:
The Python plotter library offers a simple solution for most of your graphing problems. You can draw as many lines, dots, and curves as you want on this graph.
Every third variable is responsible for formatting. The letter g stands for green, while the letter r stands for red. You could have used b for blue, y for yellow, and so on. In the absence of a color, each triple will be displayed using a different color. The o character symbolizes that we want to display a dot where each data point lies. Therefore, 'go' has nothing to do with movement – it requests the plotter to plot green dots. The '-' characters are responsible for displaying a dashed line. If you just use one minus, a straight line appears instead of the dashed line.
If we simplify this formatting, we can specify that we only want dots of an arbitrary color, and straight lines of another arbitrary color. By doing this, we can simply write the following plot call:
plot.plot(
x, y, 'o',
x, fitY(x), '-'
)
The output is as follows:
When displaying curves, the plotter connects the dots with segments. Also, keep in mind that even a complex sequence of curves is an approximation that connects the dots. For instance, if you execute the code from https://gist.github.com/traeblain/1487795, you will recognize the segments of the batman function as connected lines:
There is a wide variety of ways to plot curves. We have seen that the polyfit method of the NumPy library returns an array of coefficients to describe a linear equation:
import numpy as np
x = np.array(range(1, 14))
y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])
np.polyfit(x, y, 1)
# array([ 4.85714286, -2.76923077])
This array describes the equation 4.85714286 * x - 2.76923077.
Suppose we now want to plot a curve, y = -x**2 + 3*x - 2. This quadratic equation is described by the coefficient array [-1, 3, -2]. We could write our own function to calculate the y values belonging to x values. However, the NumPy library already has a feature to do this work for us: np.poly1d:
import numpy as np
x = np.array(range( -10, 10, 0.2 ))
f = np.poly1d([-1,3,-2])
The f function that's created by the poly1d call not only works with single values, but also with lists or NumPy arrays:
f(5)
# -12
f(x)
# array([-132, -110, -90, -72, -56, -42, -30, -20, -12, -6, -2, 0, 0, -2, -6, -12, -20, -30, -42, -56])
We can now use these values to plot a non-linear curve:
import matplotlib.pyplot as plot
plot.plot(x, f(x))
The output is as follows:
Suppose we are interested in the y value belonging to the x coordinate 20. Based on the linear regression model, all we need to do is substitute the value of 20 in the place of x:
# Plotting the points
plot.scatter(x, y)
# Plotting the prediction belonging to x = 20
plot.scatter(20, a * 20 + b, color='red')
# Plotting the line
plot.plot([0, 25], [b, 25*a+b])
The output is as follows:
Here, we denoted the predicted value with red. This red point is on the best fit line.
You are working at the government office of Metropolis, trying to forecast the need for elementary school capacity. Your task is to figure out a 2025 and 2030 prediction for the number of children starting elementary school. The past data is as follows:
Plot these tendencies on a two-dimensional chart. To do this, you must:
The solution to this activity is available at page 269.
In the previous topic, we dealt with linear regression with one variable. Now we will learn an extended version of linear regression, where we will use multiple input variables to predict the output.
We will rely on examples where we will load and predict stock prices. Therefore, we will experiment with the main libraries used for loading stock prices.
If you recall the formula for the line of best fit in linear regression, it was defined as y = a*x + b, where a is the slope of the line, b is the y-intercept of the line, x is the feature value, and y is the calculated label value.
In multiple regression, we have multiple features and one label. Assuming that we have three features, x1, x2, and x3, our model changes as follows:
y = a1 * x1 + a2 * x2 + a3 * x3 + b
In NumPy array format, we can write this equation as follows:
y = np.dot(np.array([a1, a2, a3]), np.array([x1, x2, x3])) + b
For convenience, it makes sense to define the whole equation in a vector multiplication form. The coefficient of b is going to be 1:
y = np.dot(np.array([b, a1, a2, a3]) * np.array([1, x1, x2, x3]))
Multiple linear regression is a simple scalar product of two vectors, where the coefficients b, a1, a2, and a3 determine the best fit equation in four-dimensional space.
To understand the formula of multiple linear regression, you will need the scalar product of two vectors. As the other name for a scalar product is dot product, the NumPy function performing this operation is called dot:
import numpy as np
v1 = [1, 2, 3]
v2 = [4, 5, 6]
np.dot( v1, v2 ) = 1 * 4 + 2 * 5 + 3 * 6 = 32
We simply sum the product of each respective coordinate.
We can determine these coefficients by minimizing the error between data points and the nearest points described by the equation. For simplicity, we will omit the mathematical solution of the best fit equation, and use scikit-learn instead.
In n-dimensional spaces, where n is greater than 3, the number of dimensions determines the different variables that are in our model. In the preceding example, we have three features and one label. This yields four dimensions. If you want to imagine a four-dimensional space, you can imagine three-dimensional space and time for simplification. A five-dimensional space can be imagined as a four-dimensional space, where each point in time has a temperature. Dimensions are just features (and labels); they do not necessarily correlate with our concept of three-dimensional space.
We will follow the following simple steps to solve Linear Regression problems:
There are multiple libraries that can provide us with access to data sources. As we will be working with stock data, let's cover two examples that are geared toward retrieving financial data, Quandl and Yahoo Finance:
The process of loading stock data with Yahoo Finance is straightforward. All you need to do is install the fix_yahoo_finance package using the following command in the CLI::
pip install fix_yahoo_finance
We will download a dataset that has open, high, low, close, adjusted close, and volume values of the S&P 500 index, starting from 2015:
import fix_yahoo_finance as yahoo
spx_data_frame = yahoo.download("^GSPC", "2015-01-01")
That's all you need to do. The data frame containing the S&P 500 index is ready.
You can plot the index prices with the plot method:
spx_data_frame.Close.plot()
The output is as follows:
It is also possible to save data to a CSV file using the following command:
spx.to_csv("spx.csv")
Suppose a CSV file containing stock data is given. We will now use pandas to load data from this file:
import pandas as pd
spx_second_frame = pd.read_csv("spx.csv", index_col="Date", header=0, parse_dates=True)
To properly parse data, we must set the index column name, specify the absence of headers, and make sure that dates are parsed as dates.
Quandl.com is a reliable source of financial and economic datasets.
pip install quandl
import quandl
data_frame = quandl.get("YALE/SPCOMP")
data_frame.head()
S&P Composite Dividend Earnings CPI Long Interest Rate
Year
1871-01-31 4.44 0.26 0.4 12.464061 5.320000
1871-02-28 4.50 0.26 0.4 12.844641 5.323333
1871-03-31 4.61 0.26 0.4 13.034972 5.326667
1871-04-30 4.74 0.26 0.4 12.559226 5.330000
1871-05-31 4.86 0.26 0.4 12.273812 5.333333
Real Price Real Dividend Real Earnings
Year
1871-01-31 89.900119 5.264421 8.099110
1871-02-28 88.415295 5.108439 7.859137
1871-03-31 89.254001 5.033848 7.744382
1871-04-30 95.247222 5.224531 8.037740
1871-05-31 99.929493 5.346022 8.224650
Cyclically Adjusted PE Ratio
Year
1871-01-31 NaN
1871-02-28 NaN
1871-03-31 NaN
1871-04-30 NaN
1871-05-31 NaN
...
2016-02-29 24.002607
2016-03-31 25.372299
Before we perform regression, we must choose the features we are interested in, and we also have to figure out the data range on which we do the regression.
Preparing the data for prediction is the second step in the regression process. This step also has several sub-steps. We will go through these sub-steps in the following order:
A few features highly correlate to each other. For instance, the Real Dividend column proportionally grows with Real Price. The ratio between them is not always similar, but they do correlate.
As regression is not about detecting correlation between features, we would rather get rid of a few attributes we know are redundant and perform regression on the features that are non-correlated.
If you have gone through the Loading stock prices with Quandl section, you already have a data frame containing historical data on the S&P 500 Index. We will keep the Long Interest Rate, Real Price, and Real Dividend columns:
data_frame[['Long Interest Rate', 'Real Price', 'Real Dividend', 'Cyclically Adjusted PE Ratio']]
As you cannot work with NaN data, you can replace it by filling in numbers in place of NaNs. In general, you have two choices:
data_frame.fillna(-100, inplace=True)
We can check the length of the data frame by using the len function, as shown in the following code:
len(data_frame)
1771
The length of our data frame is 1771.
If we want to predict the Real Price for the upcoming 20 years, we will have to predict 240 values. This is approximately 15% of the length of the data frame, which makes perfect sense.
We will therefore create a Real Price Label by shifting the Real Price values up by 240 units:
data_frame['Real Price Label'] = data_frame['Real Price'].shift( -240 )
This way, each Real Price Label value will be the Real Price value in 20 years.
The side effect of shifting these values is that NaN values appear in the last 240 values:
data_frame.tail()
The output is as follows:
S&P Composite Dividend Earnings CPI Long Interest Rate
Year
2018-03-31 2702.77 50.00 NaN 249.5540 2.840
2018-04-30 2653.63 50.33 NaN 250.5460 2.870
2018-05-31 2701.49 50.66 NaN 251.5880 2.976
2018-06-30 2754.35 50.99 NaN 252.1090 2.910
2018-07-31 2736.61 NaN NaN 252.3695 2.830
Real Price Real Dividend Real Earnings
Year
2018-03-31 2733.262995 50.564106 NaN
2018-04-30 2672.943397 50.696307 NaN
2018-05-31 2709.881555 50.817364 NaN
2018-06-30 2757.196024 51.042687 NaN
2018-07-31 2736.610000 NaN NaN
Cyclically Adjusted PE Ratio Real Price Label
Year
2018-03-31 31.988336 NaN
2018-04-30 31.238428 NaN
2018-05-31 31.612305 NaN
2018-06-30 32.091415 NaN
2018-07-31 31.765318 NaN
We can get rid of them by executing dropna on the data frame:
data_frame.dropna(inplace=True)
This way, we have data up to 1998 July, and we have the future values up to 2018 in the Real Price Label column:
data_frame.tail()
The output is as follows:
S&P Composite Dividend Earnings CPI Long Interest Rate
Year
1998-03-31 1076.83 15.6400 39.5400 162.2 5.65
1998-04-30 1112.20 15.7500 39.3500 162.5 5.64
1998-05-31 1108.42 15.8500 39.1600 162.8 5.65
1998-06-30 1108.39 15.9500 38.9700 163.0 5.50
1998-07-31 1156.58 16.0167 38.6767 163.2 5.46
Real Price Real Dividend Real Earnings
Year
1998-03-31 1675.456527 24.334519 61.520900
1998-04-30 1727.294510 24.460428 61.112245
1998-05-31 1718.251850 24.570372 60.705096
1998-06-30 1716.097117 24.695052 60.336438
1998-07-31 1788.514193 24.767932 59.808943
Cyclically Adjusted PE Ratio Real Price Label
Year
1998-03-31 36.296928 2733.262995
1998-04-30 37.276934 2672.943397
1998-05-31 36.956599 2709.881555
1998-06-30 36.802293 2757.196024
1998-07-31 38.259645 2736.610000
Let's prepare our features and labels for regression.
For the features, we will use the drop method of the data frame. The drop method returns a new data frame that doesn't contain the column that was dropped:
import numpy as np
features = np.array(data_frame.drop('Real Price Label', 1))
label = np.array(data_frame['Real Price Label'])
The 1 in the second argument specifies that we are dropping columns. As the original data frame was not modified, the label can be directly extracted from it.
It is now time to scale the features with the preprocessing module of Scikit Learn:
from sklearn import preprocessing
scaled_features = preprocessing.scale(features)
features
array([[6.19000000e+00, 2.65000000e-01, 4.85800000e-01, ...,
7.10000389e+00, 1.30157807e+01, 1.84739523e+01],
[6.17000000e+00, 2.70000000e-01, 4.81700000e-01, ...,
7.16161179e+00, 1.27768459e+01, 1.81472582e+01],
[6.24000000e+00, 2.75000000e-01, 4.77500000e-01, ...,
7.29423423e+00, 1.26654431e+01, 1.82701191e+01],
...,
[1.10842000e+03, 1.58500000e+01, 3.91600000e+01, ...,
2.45703721e+01, 6.07050959e+01, 3.69565985e+01],
[1.10839000e+03, 1.59500000e+01, 3.89700000e+01, ...,
2.46950523e+01, 6.03364381e+01, 3.68022935e+01],
[1.15658000e+03, 1.60167000e+01, 3.86767000e+01, ...,
2.47679324e+01, 5.98089427e+01, 3.82596451e+01]])
scaled_features
array([[-0.47564285, -0.62408514, -0.57496262, ..., -1.23976862,
-0.84099698, 0.6398416 ],
[-0.47577027, -0.62273749, -0.5754623 , ..., -1.22764677,
-0.85903686, 0.57633607],
[-0.47532429, -0.62138984, -0.57597417, ..., -1.20155224,
-0.86744792, 0.60021881],
...,
[ 6.54690076, 3.57654404, 4.13838295, ..., 2.19766676,
2.75960615, 4.23265262],
[ 6.54670962, 3.60349707, 4.11522706, ..., 2.22219859,
2.73177202, 4.20265751],
[ 6.85373845, 3.62147473, 4.07948167, ..., 2.23653834,
2.69194545, 4.48594968]])
As you can see, the scaled features are easier to read and interpret. While scaling data, we must scale all data together, including training and testing data.
Now that scaling is done, our next task is to separate the training and testing data from each other. We will be using 90% of the data as training data, and the rest (10%) will be used as test data:
from sklearn import model_selection
(features_train, features_test, label_train, label_test) =
model_ selection.train_test_split(
scaled_features, label, test_size=0.1
)
The train_test_split function shuffles the lines of our data, keeping the correspondence, and puts approximately 10% of all data in the test variables, keeping 90% for the training variables. This will help us evaluate how good our model is.
We can now create the linear regression model based on the training data:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(features_train, label_train)
Once the model is ready, we can use it to predict the labels belonging to the test feature values:
label_predicted = model.predict(features_test)
If you are interested in the relationship between the predicted feature values and the accurate test feature values, you can plot them using a Python two-dimensional graph plotter utility:
from matplotlib import pyplot as plot
plot.scatter(label_test, label_predicted)
This gives you an image of a graph where the test data is compared to the results of the prediction. The closer these values are to the y = x line, the better.
You can see from the following graph that the predictions do center around the y=x line with a degree of error. This error is obvious, as otherwise, we would be able to make a lot of money with such a simple prediction, and everyone would pursue predicting stock prices instead of working in their own field of expertise:
We can conclude that there is a degree of error in the model. The question is, how can we quantify this error? The answer is simple: we can score the model using a built-in utility that calculates the mean square error of the model:
model.score(features_test, label_test)
0.9051697119010782
We can conclude that the model is very accurate. This is not a surprise, because every single financial advisor scammer tends to tell us that the market grows at around 6-7% a year. This is a linear growth, and the model essentially predicts that the markets will continue growing at a linear rate. Making a conclusion that markets tend to go up in the long run is not rocket science.
We have already used prediction on test data. Now, it's time to use the actual data to see into the future.
import quandl
import numpy as np
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import linear_model
data_frame = quandl.get("YALE/SPCOMP")
data_frame[['Long Interest Rate', 'Real Price', 'Real Dividend', 'Cyclically Adjusted PE Ratio']]
data_frame.fillna(-100, inplace=True)
data_frame['Real Price Label'] = data_frame['Real Price'].shift(-240)
data_frame.dropna(inplace=True)
features = np.array(data_frame.drop('Real Price Label', 1))
label = np.array(data_frame['Real Price Label'])
scaled_features = preprocessing.scale(features)
(features_train, features_test, label_train, label_test) =
model_ selection.train_test_split(
scaled_features, label, test_size=0.1
)
model = linear_model.LinearRegression()
model.fit(features_train, label_train)
label_predicted = model.predict(features_test)
The trick to predicting the future is that we have to save the values belonging to the values we dropped when building the model. We built our stock price model based on historical data from 20 years ago. Now, we have to keep this data, and we also have to include this data in scaling:
import quandl
import numpy as np
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import linear_model
data_frame = quandl.get("YALE/SPCOMP")
data_frame[['Long Interest Rate', 'Real Price', 'Real Dividend', 'Cyclically Adjusted PE Ratio']]
data_frame.fillna(-100, inplace=True)
# We shift the price data to be predicted 20 years forward
data_frame[ 'Real Price Label'] = data_frame['Real Price'].shift(-240)
# Then exclude the label column from the features
features = np.array(data_frame.drop('Real Price Label', 1))
# We scale before dropping the last 240 rows from the
# features
scaled_features = preprocessing.scale(features)
# Save the last 240 rows before dropping them
scaled_features_latest_240 = scaled_features[-240:]
# Exclude the last 240 rows from the data used for model
# building
scaled_features = scaled_features[:-240]
# Now we can drop the last 240 rows from the data frame
data_frame.dropna(inplace=True)
# Then build the labels from the remaining data
label = np.array(data_frame['Real Price Label'])
# The rest of the model building stays
(features_train, features_test, label_train, label_test) =
model_ selection.train_test_split(
scaled_features, label, test_size=0.1
)
model = linear_model.LinearRegression()
model.fit(features_train, label_train)
Now that we have access to the scaled values of the features from the last 20 years, we can now predict the index prices of the next 20 years using the following code:
label_predicted = model.predict(scaled_features_latest_240)
This sounds great in theory, but in practice, using this model for making money by betting on the forecast is by no means better than gambling in a casino. This is just an example model to illustrate prediction; it is definitely not sufficient to be used for short-term or long-term speculation on market prices.
If you look at the values, you can see why this prediction may easily backfire. First, there are a few negative values, which are impossible for indices. Then, due to a few major market crashes, linear regression made a doomsday forecast a point in the future, where the index will drop from more than 3,000 to literally zero within a year. Linear regression is not a perfect tool to look ahead 20 years based on limited data. Also, note that stock prices are meant to be close to time invariant systems. This means that the past does not imply any patterns in the future.
Let's output the prediction belonging to the first ten years:
from matplotlib import pyplot as plot
plot.plot(list(range(1,241)), label_predicted[:240])
The output is as follows:
The graph is hard to read near the end due to extreme values. Let's draw our conclusions by omitting the last five years and just plotting the first 180 months out of the predictions:
plot.plot(list(range(1,181)), label_predicted[:180])
The output is as follows:
This is a scary future for the American economy. According to this model, the S&P 500 has a bull run for about 2.5-3 years and doesn't recover for a long time. Also, notice that our model does not know that index values cannot be negative.
When performing polynomial regression, the relationship between x and y, or using their other names, features and labels, is not a linear equation, but a polynomial equation. This means that instead of the y = a*x+b equation, we can have multiple coefficients and multiple powers of x in the equation.
To make matters even more complicated, we can perform polynomial regression using multiple variables, where each feature may have coefficients multiplying different powers of the feature.
Our task is to find a curve that best fits our dataset. Once polynomial regression is extended to multiple variables, we will learn the Support Vector Machines model to perform polynomial regression.
As a recap, we have performed two types of regression so far:
We will now learn how to do polynomial linear regression with one variable. The equation for polynomial linear regression is as follows:
y = b + a1*x + a2*(x ** 2) + a3*(x ** 3) + … + an * (x ** n)
have a vector of coefficients (b, a1, a2, …, an) multiplying a vector of degrees of x in the polynomial, (1, x**1, x**2, …, x**n).
At times, polynomial regression works better than linear regression. If the relationship between labels and features can be described using a linear equation, then using linear equation makes perfect sense. If we have a non-linear growth, polynomial regression tends to approximate the relationship between features and labels better.
The simplest implementation of linear regression with one variable was the polyfit method of the NumPy library. In the next exercise, we will perform multiple polynomial linear regression with degrees of 2 and 3.
Even though our polynomial regression has an equation containing coefficients of x ** n, this equation is still referred to as polynomial linear regression in the literature. Regression is made linear not because we restrict the usage of higher powers of x in the equation, but because the coefficients a1, a2, …, and so on are linear in the equation. This means that we use the toolset of linear algebra, and work with matrices and vectors to find the missing coefficients that minimize the error of the approximation.
Perform a 1st, 2nd, and 3rd degree polynomial regression on the following two datasets:
Declare the two datasets
import numpy as np
from matplotlib import pyplot as plot
# First dataset:
x1 = np.array(range(1, 14))
y1 = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])
# Second dataset:
x2 = np.array(range(1, 14))
y2 = np.array([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144])
Then plot your results on the graph:
Let's start with plotting the first example:
import matplotlib.pyplot as plot
deg1 = np.polyfit(x1, y1, 1)
# array([ 4.85714286, -2.76923077])
f1 = np.poly1d(deg1)
deg2 = np.polyfit(x1, y1, 2)
# array([-0.03196803, 5.3046953 , -3.88811189])
f2 = np.poly1d(deg2)
deg3 = np.polyfit(x1, y1, 3)
# array([-0.01136364, 0.20666833, 3.91833167, -1.97902098])
f3 = np.poly1d(deg3)
plot.plot(
x1, y1, 'o',
x1, f1(x1),
x1, f2(x1),
x1, f3(x1)
)
plot.show()
The output is as follows:
As the coefficients are enumerated from left to right in order of decreasing degree, we can see that the higher degree coefficients stay close to negligible. In other words, the three curves are almost on top of each other, and we can only detect a divergence near the right edge. This is because we are working on a dataset that can be very well approximated with a linear model.
In fact, the first dataset was created out of a linear function. Any nonzero coefficients for x**2 and x**3 are the results of overfitting the model based on the available data. The linear model is better for predicting values outside the range of the training data than any higher degree polynomial.
Let's contrast this behavior with the second example. We know that the Fibonacci sequence is non-linear. So, using a linear equation to approximate it is a clear case for underfitting. Here, we expect a higher degree polynomic to perform better:
deg1 = np.polyfit(x2, y2, 1)
# array([ 9.12087912, -34.92307692])
f1 = np.poly1d(deg1)
deg2 = np.polyfit(x2, y2, 2)
# array([ 1.75024975, -15.38261738, 26.33566434])
f2 = np.poly1d(deg2)
deg3 = np.polyfit(x2, y2, 3)
# array([0.2465035, -3.42632368, 14.69080919,
# -15.07692308])
f3 = np.poly1d(deg3)
plot.plot(
x2, y1, 'o',# blue dots
x2, f1(x2), # orange
x2, f2(x2), # green
x2, f3(x2) # red
)
The output is as follows:
The difference is clear. The quadratic curve fits the points a lot better than the linear one. The cubic curve is even better.
If you research Binet's formula, you will find out that the Fibonacci function is an exponential function, as the xth Fibonacci number is calculated as the xth power of a constant. Therefore, the higher degree polynomial we use, the more accurate our approximation will be.
When we have one variable of degree n, we have n+1 coefficients in the equation:
y = b + a1*x + a2*(x ** 2) + a3*(x ** 3) + … + an * (x ** n)
Once we deal with multiple features x1, x2, …, xm, and their powers of up to the nth degree, we get an m * (n+1) matrix of coefficients. The math will become quite lengthy once we start exploring the details and prove how a polynomial model works. We will also lose the nice visualizations of two-dimensional curves.
Therefore, we will apply the chapters learned in previous section on polynomial regression with one variable and omit the math. When training and testing a Linear Regression model, we can calculate the mean square error to see how good an approximation a model is.
In scikit-learn, the degree of the polynomials used in the approximation is a simple parameter in the model.
As polynomial regression is a form of linear regression, we can perform polynomial regression without changing the regression model. All we need to do is transform the input and keep the linear regression model.
The transformation of the input is performed by the fit_transform method of the PolynomialFeatures package:
import numpy as np
import quandl
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import linear_model
from matplotlib import pyplot as plot
from sklearn.preprocessing import PolynomialFeatures
data_frame = quandl.get("YALE/SPCOMP")
data_frame.fillna(-100, inplace=True)
# We shift the price data to be predicted 20 years forward
data_frame['Real Price Label'] = data_frame['Real Price'].shift(-240)
# Then exclude the label column from the features
features = np.array(data_frame.drop('Real Price Label', 1))
# We scale before dropping the last 240 rows from the features
scaled_features = preprocessing.scale(features)
# Save the last 240 rows before dropping them
scaled_features_latest_240 = scaled_features[-240:]
# Exclude the last 240 rows from the data used for model building
scaled_features = scaled_features[:-240]
# Now we can drop the last 240 rows from the data frame
data_frame.dropna(inplace=True)
# Then build the labels from the remaining data
label = np.array(data_frame['Real Price Label'])
# Create a polynomial regressor model and fit it to the training data
poly_regressor = PolynomialFeatures(degree=3)
poly_scaled_features = poly_regressor.fit_transform(scaled_features)
# Split to training and testing data
(
poly_features_train, poly_features_test,
poly_label_train, poly_label_test
) = model_selection.train_test_split(
poly_scaled_ features,
label, test_size=0.1
)
# Apply linear regression
model = linear_model.LinearRegression()
model.fit(poly_features_train, poly_label_train)
# Model score
print('Score: ', model.score(poly_features_test, poly_label_test))
# Prediction
poly_label_predicted = model.predict(poly_features_test)
plot.scatter(poly_label_test, poly_label_predicted)
The model scores too well. Chances are, the polynomial model is overfitting the dataset.
There is another model in scikit-learn that performs polynomial regression called the SVM model, which stands for Support Vector Machines.
Support Vector Machines are binary classifiers defined on a vector space. Vector Machines divide the state space with a surface. An SVM classifier takes classified data and tries to predict where unclassified data belongs. Once the classification of a data point is determined, it gets labeled.
Support Vector Machines can also be used for regression. Instead of labeling data, we can predict future values in a series. The Support Vector Regression model uses the space between our data as a margin of error. Based on the margin of error, it makes predictions regarding future values.
If the margin of error is too small, we risk overfitting the existing dataset. If the margin of error is too big, we risk underfitting the existing dataset.
A kernel describes the surface dividing the state space in the case of a classifier. A kernel is also used to measure the margin of error in the case of a regressor. This kernel can use a linear model, a polynomial model, or many other possible models. The default kernel is RBF, which stands for Radial Basis Function.
Support Vector Regression is an advanced topic, which is outside the scope of this book. Therefore, we will only stick to a walkthrough of how easy it is to try out another regression model on our test data.
Suppose we have our features and labels in two separate NumPy arrays. Let's recall how we performed linear regression on them:
from sklearn import model_selection
from sklearn import linear_model
(features_train, features_test, label_train,
label_test) = model_selection.train_test_split(scaled_features, label, test_size=0.1)
model = linear_model.LinearRegression()
model.fit(features_train, label_train)
We can perform regression with Support Vector Machines by changing the linear model to a support vector model:
from sklearn import model_selection
from sklearn import svm
from matplotlib import pyplot as plot
# The rest of the model building stays the same
(features_train, features_test, label_train,
label_test) = model_selection.train_test_split(scaled_features, label, test_size=0.1)
model = svm.SVR()
model.fit(features_train, label_train)
label_predicted = model.predict(features_test)
print('Score: ', model.score(features_test, label_test))
plot.plot(label_test, label_predicted, 'o')
The output is as follows:
The output is as follows:
-0.19365084431020874
The model score is quite low, and the points don't align on the y=x line. Prediction with the default values is quite low.
The output of the model describes the parameters of the Support Vector Machine:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
We could fiddle with these parameters to increase the accuracy of the prediction by creating a better algorithm.
Let's switch the kernel of the Support Vector Machine to poly. The default degree of the polynomial is 3:
model = svm.SVR(kernel='poly')
model.fit(features_train, label_train)
label_predicted = model.predict(features_test)
plot.plot(label_test, label_predicted, 'o')
The output is as follows:
model.score(features_test, label_test)
The output is as follows:
0.06388628722032952
With Support Vector Machines, we often end up with points concentrated in small areas. We could change the margin of error to separate the points a bit more.
In this section, we will discuss how to perform linear, polynomial, and support vector regression with scikit-learn. We will also learn how to find the best fit model for a given task. We will be assuming that you are a software engineer at a financial institution and your employer wants to know whether linear regression or support vector regression is a better fit for predicting stock prices. You will have to load all of the data of the S&P 500 from a data source. Then, you will need to build a regressor using linear regression, cubic polynomial linear regression, and a support vector regression with a polynomial kernel of degree 3 before separating the training and test data. Plot the test labels and the prediction results and compare them with the y=x line. Finally, compare how well the three models score.
The closer the dots are to the y=x line, the less error the model works with.
Perform a linear multiple regression with quadratic polynomials. The only change is in the Linear Regression model.
The model does not look efficient at all. For some reason, this model clearly prefers lower values for the S&P 500 that are completely unrealistic, assuming that the stock market does not lose 80% of its value within a day.
The solution to this activity is available at page 271.
In this chapter, we have learned the fundamentals of Linear Regression.
After going through some basic mathematics, we dived into the mathematics of linear regression using one variable and multiple variables.
Challenges occurring with regression include loading data from external sources such as a .csv file, Yahoo Finance, or Quandl were dealt with. After loading the data, we learned how to identify the features and labels, how to scale data, and how to format data to perform regression.
We learned how to train and test a linear regression engine, and how to predict the future. Our results were visualized by an easy-to-use Python graph plotting library called pyplot.
A more complex form of linear regression is a linear polynomial regression of arbitrary degree. We learned how to define these regression problems on multiple variables. We compared their performance to each other on stock price prediction problems. As an alternative to polynomial regression, we also introduced Support Vector Machines as a regression model and experimented with two kernels.
You will soon learn about another field inside machine learning. The setup and code structure of this machine learning method will be very similar to regression, while the problem domain is somewhat different. In the next chapter, you will learn the ins and outs of classification.