Regression

Learning Objectives

By the end of this chapter, you will be able to:

  • Describe the mathematical logic involved in regression
  • Illustrate the use of the NumPy library for Regression
  • Identify linear regression with one variable and with multiple variables
  • Use polynomial regression

This chapter covers the fundamentals of linear and polynomial regression.

Introduction

Regression is a broad topic that connects mathematical statistics, data science, machine learning, and artificial intelligence. As the basics of regression are rooted in mathematics, we will start by exploring the mathematical fundamentals.

Most of this topic will deal with different forms of linear regression, including linear regression with one variable, linear regression with multiple variables, polynomial regression with one variable, and polynomial regression with multiple variables. Python provides a lot of support for performing regression operations.

We will also use alternative regression models while comparing and contrasting support vector Regression with forms of Linear Regression. Throughout this chapter, we will use stock price data loaded from an online service provider. The models in this chapter are not intended to provide trading or investment advice.

Note

Although it is not suggested to use the models in this chapter to provide trading or investment advice, it is a very exciting and interesting journey that explains the fundamentals of regression.

Linear Regression with One Variable

A general regression problem can be defined as follows. Suppose we have a set of data points. We need to figure out a best fit curve to approximately fit the given data points. This curve will describe the relationship between our input variable x, which is the data points, and output variable y, which is the curve.

In real life, we often have multiple input variables determining one output variable. Regression helps us understand how the output variable changes when we keep all but one input variable fixed, and we change the remaining input variable.

What Is Regression?

In this chapter, we will work with regression on the two-dimensional plane. This means that our data points are two-dimensional, and we are looking for a curve to approximate how to calculate one variable from another.

We will learn about the following types of regression:

  • Linear regression with one variable using a polynomial of degree 1: This is the most basic form of regression, where a straight line approximates the trajectory of future datasets.
  • Linear regression with multiple variables using a polynomial of degree 1: We will be using equations of degree 1, but we will now allow multiple input variables, also known as features.
  • Polynomial regression with one variable: This is a generic form of linear regression of one variable. As the polynomial used to approximate the relationship between the input and the output is of an arbitrary degree, we can create curves that fit the data points better than a straight line. The regression is still linear – not because the polynomial is linear, but because the regression problem can be modeled using linear algebra.
  • Polynomial regression with multiple variables: This is the most generic regression problem using higher degree polynomials and multiple features to predict the future.
  • Support vector regression: This form of regression uses support vector machines to predict data points. This type of regression is included to compare its usage to the other four regression types.

In this topic, we will deal with the first type of linear regression: we will use one variable, and the polynomial of the regression will describe a straight line.

On the two-dimensional plane, we will use the Déscartes coordinate system, more commonly known as the Cartesian coordinate system. We have an X and a Y axis, and the intersection of these two axes is the origin. We denote points by their X and Y coordinates.

For instance, the point (2, 1) corresponds to the orange point on the following coordinate system:

Figure 3.1: Representation of point (2,1) on the coordinate system

A straight line can be described with the equation y = a*x + b, where a is the slope of the equation, determining how steeply the equation climbs up, and b is a constant determining where the line intersects the Y axis

In the following diagram, you can see three equations:

  • The blue line is described with the y = 2*x + 1 equation.
  • The orange line is described with the y = x + 1 equation.
  • The purple line is described with the y = 0.5*x + 1 equation.

You can see that all three equations intersect the y-axis at 1, and their slope is determined by the factor by which we multiply x.

If you know x, you can figure out y. If you know y, you can figure out x:

Figure 3.2: Representation of the equations y = 2*x + 1, y = x + 1, and y = 0.5*x + 1 on the coordinate system

We can describe more complex curves with equations too:

Figure 3.3: Image showing a complex curve

Note

If you would like to experiment more with the Cartesian coordinate system, you can use the following plotter: https://s3-us-west-2.amazonaws.com/oerfiles/College+Algebra/calculator.html.

Features and Labels

In machine learning, we differentiate between features and labels. Features are considered our input variables, and labels are our output variables.

When talking about regression, the possible values of labels is a continuous set of rational numbers.

Think about features as values on the X-axis, and labels as the value on the Y-axis.

The task of regression is to predict label values based on feature values. We often create a label by shifting values of a feature forward. For instance, if we would like to predict stock prices in 1 month, and we create the label by shifting the stock price feature 1 month to the future, then:

  • For each stock price feature value that's at least 1 month old, training data is available that shows the predicted stock price data 1 month in the future
  • For the last month, prediction data is not available, so these values are all NaN (not a number)

We must drop the last month, because we cannot use these values for the prediction.

Feature Scaling

At times, we have multiple features that may have values within completely different ranges. Imagine comparing micrometers on a map to kilometers in the real world. They won't be easy to handle because of the magnitudinal difference of nine zeros.

A less dramatic difference is the difference between imperial and metric data. Pounds and kilograms, and centimeters and inches, don't compare that well.

Therefore, we often scale our features to normalized values that are easier to handle, as we can compare values of this range more easily. We scale training and testing data together. Ranges are typically scaled within [-1;1].

We will demonstrate two types of scaling:

  • Min-Max normalization
  • Mean normalization

Min-max scaling is calculated as follows:

x_scaled[n] = (x[n] - min(x)) / (max(x)-min(x))

Mean normalization is calculated as follows:

avg = sum(x) / len(x)

x_scaled[n] = (x[n] – avg) / (max(x)-min(x))

[(float(i)-avg)/(max(fibonacci)-min(fibonacci)) for i in fibonacci]

Here's an example of Min-Max and min-max:

fibonacci = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144]

# Min-Max scaling:

[(float(i)-min(fibonacci))/(max(fibonacci)-min(fibonacci)) for i in fibonacci]

[0.0,

0.006944444444444444,

0.006944444444444444,

0.013888888888888888,

0.020833333333333332,

0.034722222222222224,

0.05555555555555555,

0.09027777777777778,

0.14583333333333334,

0.2361111111111111,

0.3819444444444444,

0.6180555555555556,

1.0]

# Mean normalization:

avg = sum(fibonacci) / len(fibonacci)

# 28.923076923076923

[(float(i)-avg)/(max(fibonacci)-min(fibonacci)) for i in fibonacci]

[-0.20085470085470086,

-0.19391025641025642,

-0.19391025641025642,

-0.18696581196581197,

-0.18002136752136752,

-0.16613247863247863,

-0.1452991452991453,

-0.11057692307692307,

-0.05502136752136752,

0.035256410256410256,

0.18108974358974358,

0.4172008547008547,

0.7991452991452992]

Scaling could add to the processing time, but often it is a sensible step to add.

In the scikit-learn library, we have access to a function that scales NumPy arrays:

import numpy as np

from sklearn import preprocessing

preprocessing.scale(fibonacci)

array([-0.6925069 , -0.66856384, -0.66856384, -0.64462079, -0.62067773,

       -0.57279161, -0.50096244, -0.38124715, -0.18970269, 0.12155706,

        0.62436127, 1.43842524, 2.75529341])

The scale method performs mean normalization. Notice that the result is a NumPy array.

Cross-Validation with Training and Test Data

Cross-validation measures the predictive performance of a statistical model. The better the cross-validation result, the more you can trust that your model can be used to predict the future.

During cross-validation, we test our model's ability to predict the future on real test data. Test data is not used in the prediction process.

Training data is used to construct the model that predicts our results.

Once we load data from a data source, we typically separate data into a larger chunk of training data, and a smaller chunk of test data. This separation shuffles the entries of training and test data randomly. Then, it gives you an array of training features, their corresponding training labels, testing features, and their corresponding testing labels.

We can do the training-testing split using the model_selection library of scikit-learn.

Suppose in our dummy example that we have scaled Fibonacci data and its indices as labels:

features = preprocessing.scale(fibonacci)

label = np.array(range(13))

Let's use 10% of the data as test data.

from sklearn import model_selection

(x_train, x_test, y_train, y_test) =

model_selection.train_test_split(features, label, test_size=0.1)

x_train

array([-0.66856384, 0.12155706, -0.18970269, -0.64462079, 1.43842524,

        2.75529341, -0.6925069 , -0.38124715, -0.57279161, -0.62067773,

       -0.66856384])

x_test

array([-0.50096244, 0.62436127])

y_train

array([1, 9, 8, 3, 11, 12, 0, 7, 5, 4, 2])

y_test

array([6, 10])

With training and testing, if we get the ratios wrong, we run the risk of overfitting or underfitting the model.

Overfitting occurs when we train the model too well, and it fits the training dataset too well. The model will be very accurate on the training data, but it will not be usable in real life, because its accuracy decreases when used on any other data. The model adjusts to the random noise in the training data and assumes patterns on this noise that yield false predictions. Underfitting occurs when the model does not fit the training data well enough to recognize important characteristics of the data. As a result, it cannot make the necessary predictions on new data. One example for this is when we attempt to do linear regression on data that is not linear. For instance, Fibonacci numbers are not linear, therefore, a model on a Fibonacci-like sequence cannot be linear either.

Note

If you remember the Cartesian coordinate system, you know that the horizontal axis is the X axis, and that the vertical axis is the Y axis. Our features are on the X axis, while our labels are on the Y axis. Therefore, we use features and X as synonyms, while labels are often denoted by Y. Therefore, x_test denotes feature test data, x_train denotes feature training data, y_test denotes label test data, and y_train denotes label training data.

Fitting a Model on Data with scikit-learn

We are illustrating the process of regression on a dummy example, where we only have one feature and very limited data.

As we only have one feature, we have to format x_train by reshaping it with x_train.reshape (-1,1) to a NumPy array containing one feature.

Therefore, before executing the code on fitting the best line, execute the following code:

x_train = x_train.reshape(-1, 1)

x_test = x_test.reshape(-1, 1)

# array([a, b, c]).reshape(-1, 1) becomes:

# array([[a, b, c]])

Suppose we have train and test data for our features and labels.

We can fit a model on this data for performing prediction. We will now use linear regression for this purpose:

from sklearn import linear_model

linear_regression = linear_model.LinearRegression()

model = linear_regression.fit(x_train, y_train)

model.predict(x_test)

array([4.16199119, 7.54977143])

We can also calculate the score associated with the model:

model.score(x_test, y_test)

-0.17273705326696565

This score is the mean square error and represents the accuracy of the model. It represents how well we can predict features from labels.

This number indicates a very bad model. The best possible score is 1.0. A score of 0.0 can be achieved if we constantly predict the labels by ignoring the features. We will omit the mathematical background of this score in this book.

Our model does not perform well for two reasons:

  • 11 training data and 2 testing data are simply not enough to perform proper predictive analysis.
  • Even if we ignore the number of points, the Fibonacci x -> y function does not describe a linear relationship between x and y. Approximating a non-linear function with a line is only useful if we are very close to the training interval.

We will see a lot of more accurate models in the future, and we may even reach model scores of 0.9.

Linear Regression Using NumPy Arrays

One reason why NumPy arrays are handier than Python lists is that they can be treated as vectors. There are a few operations defined on vectors that can simplify our calculations. We can perform operations on vectors of similar lengths. The sum and the (vectorial) product of two vectors equals a vector, where each coordinate is the sum or (vectorial) product of the corresponding coordinates.

For example:

import numpy as np

v1 = np.array([1,2,3])

v2 = np.array([2,0,2])

v1 + v2 # array([3, 2, 5])

v1 * v2 # array([2, 0, 6])

The product of a vector and a scalar is a vector, where each coordinate is multiplied by the scalar:

v1 * 2 # array([2, 4, 6])

The second power of a vector equals the vectorial product of the vector with itself. The double asterisk denotes the power operator:

v1 ** 2 # array([1, 4, 9], dtype=int32)

Suppose we have a set of points in the plane. Our job is to find the best fit line.

Let's see two examples.

Our first example contains 13 values that seem linear in nature. We are plotting the following data:

[2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62]

If you wanted to draw a line that is the closest to these dots, your educated guess would be quite close to reality:

Figure 3.4: Plotted graph of values [2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62]

Our second example is the first 13 values of the Fibonacci sequence, after scaling. Although we can define a line that fits these points the closest, we can see from the distribution of the points that our model will not be too useful:

Figure 3.5: Plotted graph of Fibonacci values

We have already learned what the equation of a straight line is: y = a * x + b

In this equation, a is the slope, and b is the y-intercept. To find the line of best fit, we have to find the co-efficients a and b.

Our job is to minimize the sum of distances from the line of best fit.

In this book, we will save the thought process behind calculating the coefficients a and b, because you will find little practical use for it. We would rather utilize the mean as the arithmetic mean of the values in a list. We can use the mean function provided by NumPy for this.

Let's find the line of best fit for these two examples:

import numpy as np

from numpy import mean

x = np.array(range(1, 14))

y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])

a = (mean(x)*mean(y) - mean(x*y)) / (mean(x) ** 2 - mean( x ** 2 ))

4.857142857142859

b = mean(y) - a*mean(x)

-2.7692307692307843

Once we plot the line y = a*x + b with the preceding coefficients, we get the following graph:

Figure 3.6: Plotted graph of array values [2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62] and the line y=a*x+b

Note

You can find a linear regression calculator at http://www.endmemo.com/statistics/lr.php. You can also check the calculator to get an idea of what lines of best fit look like on a given dataset.

Regarding the scaled Fibonacci values, the line of best fit looks as follows:

Figure 3.7: Plotted graph showing Fibonacci values and the line y=a*x+b

The best fit line of the second dataset clearly appears more off from anywhere outside the trained interval.

Note

We don't have to use this method to perform linear regression. Many libraries, including scikit-learn, will help us automatize this process. Once we perform linear regression with multiple variables, we are better off using a library to perform regression for us.

Fitting a Model Using NumPy Polyfit

NumPy Polyfit can also be used to create a line of best fit for linear regression with one variable.

Recall the calculation for the line of best fit:

import numpy as np

from numpy import mean

x = np.array(range(1, 14))

y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])

a = (mean(x)*mean(y) - mean(x*y)) / (mean(x) ** 2 - mean( x ** 2 ))

4.857142857142859

b = mean(y) - a*mean(x)

-2.7692307692307843

The equation for finding the coefficients a and b is quite long. Fortunately, numpy.polyfit performs these calculations to find the coefficients of the line of best fit. The polyfit function accepts three arguments: the array of x values, the array of y values, and the degree of polynomial to look for. As we are looking for a straight line, the highest power of x is 1 in the polynomial:

import numpy as np

x = np.array(range(1, 14))

y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])

[a, b] = np.polyfit(x, y, 1)

[4.857142857142858, -2.769230769230769]

Plotting the Results in Python

Suppose you have a set of data points and a regression line. Our task is to plot the points and the line together so that we can see the results with our own eyes.

We will use the matplotlib.pyplot library for this. This library has two important functions:

  • Scatter: This displays scattered points on the plane, defined by a list of x coordinates and a list of y coordinates.
  • Plot: Along with two arguments, this function plots a segment defined by two points, or a sequence of segments defined by multiple points. Plot is like scatter, except that instead of displaying the points, they are connected by lines.

A plot with three arguments plots a segment and/or two points formatted according to the third argument

A segment is defined by two points. As x ranges between 1 and 14, it makes sense to display a segment between 0 and 15. We must substitute the value of x in the equation a*x+b to get the corresponding y values:

import matplotlib.pyplot as plot

x = np.array(range(1, 14))

y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])

a = (mean(x)*mean(y) - mean(x*y)) / (mean(x) ** 2 - mean(x ** 2))

4.857142857142859

b = mean(y) - a*mean(x)

-2.7692307692307843

# Plotting the points

plot.scatter(x, y)

# Plotting the line

plot.plot([0, 15], [b, 15*a+b])

The output is as follows:

Figure 3.8: Graph displaying how data points fit a regression line

You might have to call plot.show() to display the preceding graph. In the IPython console, the coordinate system shows up automatically.

The segment and the scattered data points are displayed as expected.

Plot has an advanced signature. You can use one call of plot to draw scattered dots, lines, and any curves on this diagram. These variables are interpreted in groups of three:

  • X values
  • Y values
  • Formatting options in the form of a string

Let's create a function for deriving an array of approximated y values from an array of approximated x values:

def fitY( arr ):

    return [4.857142857142859 * x - 2.7692307692307843 for x in arr]

We will use the fit function to plot values:

plot.plot(

    x, y, 'go',

    x, fitY(x), 'r--o'

)

The output is as follows:

Figure 3.9: Graph for the plot function using the fit function

The Python plotter library offers a simple solution for most of your graphing problems. You can draw as many lines, dots, and curves as you want on this graph.

Every third variable is responsible for formatting. The letter g stands for green, while the letter r stands for red. You could have used b for blue, y for yellow, and so on. In the absence of a color, each triple will be displayed using a different color. The o character symbolizes that we want to display a dot where each data point lies. Therefore, 'go' has nothing to do with movement – it requests the plotter to plot green dots. The '-' characters are responsible for displaying a dashed line. If you just use one minus, a straight line appears instead of the dashed line.

If we simplify this formatting, we can specify that we only want dots of an arbitrary color, and straight lines of another arbitrary color. By doing this, we can simply write the following plot call:

plot.plot(

    x, y, 'o',

    x, fitY(x), '-'

)

The output is as follows:

Figure 3.10: Graph for plot function with dashed line

When displaying curves, the plotter connects the dots with segments. Also, keep in mind that even a complex sequence of curves is an approximation that connects the dots. For instance, if you execute the code from https://gist.github.com/traeblain/1487795, you will recognize the segments of the batman function as connected lines:

Figure 3.11: Graph for the batman function

There is a wide variety of ways to plot curves. We have seen that the polyfit method of the NumPy library returns an array of coefficients to describe a linear equation:

import numpy as np

x = np.array(range(1, 14))

y = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])

np.polyfit(x, y, 1)

# array([ 4.85714286, -2.76923077])

This array describes the equation 4.85714286 * x - 2.76923077.

Suppose we now want to plot a curve, y = -x**2 + 3*x - 2. This quadratic equation is described by the coefficient array [-1, 3, -2]. We could write our own function to calculate the y values belonging to x values. However, the NumPy library already has a feature to do this work for us: np.poly1d:

import numpy as np

x = np.array(range( -10, 10, 0.2 ))

f = np.poly1d([-1,3,-2])

The f function that's created by the poly1d call not only works with single values, but also with lists or NumPy arrays:

f(5)

# -12

f(x)

# array([-132, -110, -90, -72, -56, -42, -30, -20, -12, -6, -2, 0, 0, -2, -6, -12, -20, -30, -42, -56])

We can now use these values to plot a non-linear curve:

import matplotlib.pyplot as plot

plot.plot(x, f(x))

The output is as follows:

Figure 3.12: Graph for pyplot function

Predicting Values with Linear Regression

Suppose we are interested in the y value belonging to the x coordinate 20. Based on the linear regression model, all we need to do is substitute the value of 20 in the place of x:

# Plotting the points

plot.scatter(x, y)

# Plotting the prediction belonging to x = 20

plot.scatter(20, a * 20 + b, color='red')

# Plotting the line

plot.plot([0, 25], [b, 25*a+b])

The output is as follows:

Figure 3.13: Graph showing the predicted value using Linear Regression

Here, we denoted the predicted value with red. This red point is on the best fit line.

Activity 5: Predicting Population

You are working at the government office of Metropolis, trying to forecast the need for elementary school capacity. Your task is to figure out a 2025 and 2030 prediction for the number of children starting elementary school. The past data is as follows:

Figure 3.14: A table representing number of kids starting elementary school from 2001 to 2018

Plot these tendencies on a two-dimensional chart. To do this, you must:

  1. Use linear regression. Features are the years ranging from 2001 to 2018. For simplicity, we can indicate 2001 as year 1, and 2018 as year 18.
  2. Use np.polyfit to determine the coefficients of the regression line.
  3. Plot the results using matplotlib.pyplot to determine future tendencies.

    Note

    The solution to this activity is available at page 269.

Linear Regression with Multiple Variables

In the previous topic, we dealt with linear regression with one variable. Now we will learn an extended version of linear regression, where we will use multiple input variables to predict the output.

We will rely on examples where we will load and predict stock prices. Therefore, we will experiment with the main libraries used for loading stock prices.

Multiple Linear Regression

If you recall the formula for the line of best fit in linear regression, it was defined as y = a*x + b, where a is the slope of the line, b is the y-intercept of the line, x is the feature value, and y is the calculated label value.

In multiple regression, we have multiple features and one label. Assuming that we have three features, x1, x2, and x3, our model changes as follows:

y = a1 * x1 + a2 * x2 + a3 * x3 + b

In NumPy array format, we can write this equation as follows:

y = np.dot(np.array([a1, a2, a3]), np.array([x1, x2, x3])) + b

For convenience, it makes sense to define the whole equation in a vector multiplication form. The coefficient of b is going to be 1:

y = np.dot(np.array([b, a1, a2, a3]) * np.array([1, x1, x2, x3]))

Multiple linear regression is a simple scalar product of two vectors, where the coefficients b, a1, a2, and a3 determine the best fit equation in four-dimensional space.

To understand the formula of multiple linear regression, you will need the scalar product of two vectors. As the other name for a scalar product is dot product, the NumPy function performing this operation is called dot:

import numpy as np

v1 = [1, 2, 3]

v2 = [4, 5, 6]

np.dot( v1, v2 ) = 1 * 4 + 2 * 5 + 3 * 6 = 32

We simply sum the product of each respective coordinate.

We can determine these coefficients by minimizing the error between data points and the nearest points described by the equation. For simplicity, we will omit the mathematical solution of the best fit equation, and use scikit-learn instead.

Note

In n-dimensional spaces, where n is greater than 3, the number of dimensions determines the different variables that are in our model. In the preceding example, we have three features and one label. This yields four dimensions. If you want to imagine a four-dimensional space, you can imagine three-dimensional space and time for simplification. A five-dimensional space can be imagined as a four-dimensional space, where each point in time has a temperature. Dimensions are just features (and labels); they do not necessarily correlate with our concept of three-dimensional space.

The Process of Linear Regression

We will follow the following simple steps to solve Linear Regression problems:

  1. Load data from data sources.
  2. Prepare data for prediction (normalize, format, filter).
  3. Compute the parameters of the regression line. Regardless of whether we use linear regression with one variable or with multiple variables, we will follow these steps.

Importing Data from Data Sources

There are multiple libraries that can provide us with access to data sources. As we will be working with stock data, let's cover two examples that are geared toward retrieving financial data, Quandl and Yahoo Finance:

  • scikit-learn comes with a few datasets that can be used for practicing your skills.
  • Quandl.com provides you with free and paid financial datasets.
  • pandas.io helps you load any .csv, .excel, .json, or SQL data.
  • Yahoo Finance provides you with financial datasets.

Loading Stock Prices with Yahoo Finance

The process of loading stock data with Yahoo Finance is straightforward. All you need to do is install the fix_yahoo_finance package using the following command in the CLI::

pip install fix_yahoo_finance

We will download a dataset that has open, high, low, close, adjusted close, and volume values of the S&P 500 index, starting from 2015:

import fix_yahoo_finance as yahoo

spx_data_frame = yahoo.download("^GSPC", "2015-01-01")

That's all you need to do. The data frame containing the S&P 500 index is ready.

You can plot the index prices with the plot method:

spx_data_frame.Close.plot()

The output is as follows:

Figure 3.15: Graph showing stock prices for Yahoo Finance

It is also possible to save data to a CSV file using the following command:

spx.to_csv("spx.csv")

Loading Files with pandas

Suppose a CSV file containing stock data is given. We will now use pandas to load data from this file:

import pandas as pd

spx_second_frame = pd.read_csv("spx.csv", index_col="Date", header=0, parse_dates=True)

To properly parse data, we must set the index column name, specify the absence of headers, and make sure that dates are parsed as dates.

Loading Stock Prices with Quandl

Quandl.com is a reliable source of financial and economic datasets.

Exercise 8: Using Quandl to Load Stock Prices

  1. Open the Anaconda prompt and install Quandl using the following command:

    pip install quandl

  2. Go to https://www.quandl.com/.
  3. Click on the Financial data.
  4. Among the filters, click the checkbox next to the Free label. If you have a Quandl subscription, you can use that to download stock data.
  5. Select a stock or index you would like to use. For our example, we will use the S&P Composite index data that was collected by the Yale Department of Economics. The link for this is https://www.quandl.com/data/YALE/SPCOMP-S-P-Composite.
  6. Find the Quandl ticker belonging to your instrument you would like to load. Our Quandl code for the S&P 500 data is "YALE/SPCOMP".
  7. Load the data from the Jupyter QtConsole:

    import quandl

    data_frame = quandl.get("YALE/SPCOMP")

  8. All the columns of the imported values are features:

    data_frame.head()

  9. The output is as follows:

                S&P Composite Dividend Earnings        CPI Long Interest Rate

    Year                                                                        

    1871-01-31         4.44     0.26     0.4 12.464061            5.320000

    1871-02-28         4.50     0.26     0.4 12.844641            5.323333

    1871-03-31         4.61     0.26     0.4 13.034972            5.326667

    1871-04-30         4.74     0.26     0.4 12.559226            5.330000

    1871-05-31         4.86     0.26     0.4 12.273812            5.333333

                Real Price Real Dividend Real Earnings

    Year                                                

    1871-01-31 89.900119     5.264421     8.099110

    1871-02-28 88.415295     5.108439     7.859137

    1871-03-31 89.254001     5.033848     7.744382

    1871-04-30 95.247222     5.224531     8.037740

    1871-05-31 99.929493     5.346022     8.224650

                Cyclically Adjusted PE Ratio

    Year                                    

    1871-01-31                         NaN

    1871-02-28                         NaN

    1871-03-31                         NaN

    1871-04-30                         NaN

    1871-05-31                         NaN

                                     ...

    2016-02-29                     24.002607

    2016-03-31                     25.372299

Preparing Data for Prediction

Before we perform regression, we must choose the features we are interested in, and we also have to figure out the data range on which we do the regression.

Preparing the data for prediction is the second step in the regression process. This step also has several sub-steps. We will go through these sub-steps in the following order:

  1. Suppose a data frame is given with preloaded data.
  2. Select the columns from the dataset you are interested in.
  3. Replace NaN values with a numeric value to avoid getting rid of data.
  4. Determine the forecast interval T, determining the amount of time or number of data rows you wish to look into in the future.
  5. Create a label column out of the value you wish to forecast. For row i of the data frame, the value of the label should belong to the time instant, i+T.
  6. For the last T rows, the label value is NaN. Drop these rows from the data frame.
  7. Create NumPy arrays from the features and the label.
  8. Scale the features array.
  9. Separate the training and testing data.

A few features highly correlate to each other. For instance, the Real Dividend column proportionally grows with Real Price. The ratio between them is not always similar, but they do correlate.

As regression is not about detecting correlation between features, we would rather get rid of a few attributes we know are redundant and perform regression on the features that are non-correlated.

If you have gone through the Loading stock prices with Quandl section, you already have a data frame containing historical data on the S&P 500 Index. We will keep the Long Interest Rate, Real Price, and Real Dividend columns:

data_frame[['Long Interest Rate', 'Real Price', 'Real Dividend', 'Cyclically Adjusted PE Ratio']]

As you cannot work with NaN data, you can replace it by filling in numbers in place of NaNs. In general, you have two choices:

  • Get rid of the data
  • Replace the data with a default value that makes sense

data_frame.fillna(-100, inplace=True)

We can check the length of the data frame by using the len function, as shown in the following code:

len(data_frame)

1771

The length of our data frame is 1771.

If we want to predict the Real Price for the upcoming 20 years, we will have to predict 240 values. This is approximately 15% of the length of the data frame, which makes perfect sense.

We will therefore create a Real Price Label by shifting the Real Price values up by 240 units:

data_frame['Real Price Label'] = data_frame['Real Price'].shift( -240 )

This way, each Real Price Label value will be the Real Price value in 20 years.

The side effect of shifting these values is that NaN values appear in the last 240 values:

data_frame.tail()

The output is as follows:

            S&P Composite Dividend Earnings     CPI Long Interest Rate

Year                                                                        

2018-03-31        2702.77     50.00     NaN 249.5540             2.840

2018-04-30        2653.63     50.33     NaN 250.5460             2.870

2018-05-31        2701.49     50.66     NaN 251.5880             2.976

2018-06-30        2754.35     50.99     NaN 252.1090             2.910

2018-07-31        2736.61     NaN     NaN 252.3695             2.830

             Real Price Real Dividend Real Earnings

Year                                                    

2018-03-31 2733.262995     50.564106            NaN

2018-04-30 2672.943397     50.696307            NaN

2018-05-31 2709.881555     50.817364            NaN

2018-06-30 2757.196024     51.042687            NaN

2018-07-31 2736.610000            NaN            NaN

            Cyclically Adjusted PE Ratio Real Price Label

Year                                                        

2018-03-31                     31.988336             NaN

2018-04-30                     31.238428             NaN

2018-05-31                     31.612305             NaN

2018-06-30                     32.091415             NaN

2018-07-31                     31.765318             NaN

We can get rid of them by executing dropna on the data frame:

data_frame.dropna(inplace=True)

This way, we have data up to 1998 July, and we have the future values up to 2018 in the Real Price Label column:

data_frame.tail()

The output is as follows:

            S&P Composite Dividend Earnings    CPI Long Interest Rate

Year                                                                    

1998-03-31        1076.83 15.6400 39.5400 162.2                5.65

1998-04-30        1112.20 15.7500 39.3500 162.5                5.64

1998-05-31        1108.42 15.8500 39.1600 162.8                5.65

1998-06-30        1108.39 15.9500 38.9700 163.0                5.50

1998-07-31        1156.58 16.0167 38.6767 163.2                5.46

             Real Price Real Dividend Real Earnings

Year                                                    

1998-03-31 1675.456527     24.334519     61.520900

1998-04-30 1727.294510     24.460428     61.112245

1998-05-31 1718.251850     24.570372     60.705096

1998-06-30 1716.097117     24.695052     60.336438

1998-07-31 1788.514193     24.767932     59.808943

            Cyclically Adjusted PE Ratio Real Price Label

Year                                                        

1998-03-31                     36.296928     2733.262995

1998-04-30                     37.276934     2672.943397

1998-05-31                     36.956599     2709.881555

1998-06-30                     36.802293     2757.196024

1998-07-31                     38.259645     2736.610000

Let's prepare our features and labels for regression.

For the features, we will use the drop method of the data frame. The drop method returns a new data frame that doesn't contain the column that was dropped:

import numpy as np

features = np.array(data_frame.drop('Real Price Label', 1))

label = np.array(data_frame['Real Price Label'])

The 1 in the second argument specifies that we are dropping columns. As the original data frame was not modified, the label can be directly extracted from it.

It is now time to scale the features with the preprocessing module of Scikit Learn:

from sklearn import preprocessing

scaled_features = preprocessing.scale(features)

features

array([[6.19000000e+00, 2.65000000e-01, 4.85800000e-01, ...,

        7.10000389e+00, 1.30157807e+01, 1.84739523e+01],

       [6.17000000e+00, 2.70000000e-01, 4.81700000e-01, ...,

        7.16161179e+00, 1.27768459e+01, 1.81472582e+01],

       [6.24000000e+00, 2.75000000e-01, 4.77500000e-01, ...,

        7.29423423e+00, 1.26654431e+01, 1.82701191e+01],

       ...,

       [1.10842000e+03, 1.58500000e+01, 3.91600000e+01, ...,

        2.45703721e+01, 6.07050959e+01, 3.69565985e+01],

       [1.10839000e+03, 1.59500000e+01, 3.89700000e+01, ...,

        2.46950523e+01, 6.03364381e+01, 3.68022935e+01],

       [1.15658000e+03, 1.60167000e+01, 3.86767000e+01, ...,

        2.47679324e+01, 5.98089427e+01, 3.82596451e+01]])

scaled_features

array([[-0.47564285, -0.62408514, -0.57496262, ..., -1.23976862,

        -0.84099698, 0.6398416 ],

       [-0.47577027, -0.62273749, -0.5754623 , ..., -1.22764677,

        -0.85903686, 0.57633607],

       [-0.47532429, -0.62138984, -0.57597417, ..., -1.20155224,

        -0.86744792, 0.60021881],

       ...,

       [ 6.54690076, 3.57654404, 4.13838295, ..., 2.19766676,

         2.75960615, 4.23265262],

       [ 6.54670962, 3.60349707, 4.11522706, ..., 2.22219859,

         2.73177202, 4.20265751],

       [ 6.85373845, 3.62147473, 4.07948167, ..., 2.23653834,

         2.69194545, 4.48594968]])

As you can see, the scaled features are easier to read and interpret. While scaling data, we must scale all data together, including training and testing data.

Performing and Validating Linear Regression

Now that scaling is done, our next task is to separate the training and testing data from each other. We will be using 90% of the data as training data, and the rest (10%) will be used as test data:

from sklearn import model_selection

(features_train, features_test, label_train, label_test) =

    model_ selection.train_test_split(

        scaled_features, label, test_size=0.1

    )

The train_test_split function shuffles the lines of our data, keeping the correspondence, and puts approximately 10% of all data in the test variables, keeping 90% for the training variables. This will help us evaluate how good our model is.

We can now create the linear regression model based on the training data:

from sklearn import linear_model

model = linear_model.LinearRegression()

model.fit(features_train, label_train)

Once the model is ready, we can use it to predict the labels belonging to the test feature values:

label_predicted = model.predict(features_test)

If you are interested in the relationship between the predicted feature values and the accurate test feature values, you can plot them using a Python two-dimensional graph plotter utility:

from matplotlib import pyplot as plot

plot.scatter(label_test, label_predicted)

This gives you an image of a graph where the test data is compared to the results of the prediction. The closer these values are to the y = x line, the better.

You can see from the following graph that the predictions do center around the y=x line with a degree of error. This error is obvious, as otherwise, we would be able to make a lot of money with such a simple prediction, and everyone would pursue predicting stock prices instead of working in their own field of expertise:

Figure 3.16: Graph for the plot scatter function

We can conclude that there is a degree of error in the model. The question is, how can we quantify this error? The answer is simple: we can score the model using a built-in utility that calculates the mean square error of the model:

model.score(features_test, label_test)

0.9051697119010782

We can conclude that the model is very accurate. This is not a surprise, because every single financial advisor scammer tends to tell us that the market grows at around 6-7% a year. This is a linear growth, and the model essentially predicts that the markets will continue growing at a linear rate. Making a conclusion that markets tend to go up in the long run is not rocket science.

Predicting the Future

We have already used prediction on test data. Now, it's time to use the actual data to see into the future.

import quandl

import numpy as np

from sklearn import preprocessing

from sklearn import model_selection

from sklearn import linear_model

data_frame = quandl.get("YALE/SPCOMP")

data_frame[['Long Interest Rate', 'Real Price', 'Real Dividend', 'Cyclically Adjusted PE Ratio']]

data_frame.fillna(-100, inplace=True)

data_frame['Real Price Label'] = data_frame['Real Price'].shift(-240)

data_frame.dropna(inplace=True)

features = np.array(data_frame.drop('Real Price Label', 1))

label = np.array(data_frame['Real Price Label'])

scaled_features = preprocessing.scale(features)

(features_train, features_test, label_train, label_test) =

    model_ selection.train_test_split(

        scaled_features, label, test_size=0.1

    )

model = linear_model.LinearRegression()

model.fit(features_train, label_train)

label_predicted = model.predict(features_test)

The trick to predicting the future is that we have to save the values belonging to the values we dropped when building the model. We built our stock price model based on historical data from 20 years ago. Now, we have to keep this data, and we also have to include this data in scaling:

import quandl

import numpy as np

from sklearn import preprocessing

from sklearn import model_selection

from sklearn import linear_model

data_frame = quandl.get("YALE/SPCOMP")

data_frame[['Long Interest Rate', 'Real Price', 'Real Dividend', 'Cyclically Adjusted PE Ratio']]

data_frame.fillna(-100, inplace=True)

# We shift the price data to be predicted 20 years forward

data_frame[ 'Real Price Label'] = data_frame['Real Price'].shift(-240)

# Then exclude the label column from the features

features = np.array(data_frame.drop('Real Price Label', 1))

# We scale before dropping the last 240 rows from the

# features

scaled_features = preprocessing.scale(features)

# Save the last 240 rows before dropping them

scaled_features_latest_240 = scaled_features[-240:]

# Exclude the last 240 rows from the data used for model

# building

scaled_features = scaled_features[:-240]

# Now we can drop the last 240 rows from the data frame

data_frame.dropna(inplace=True)

# Then build the labels from the remaining data

label = np.array(data_frame['Real Price Label'])

# The rest of the model building stays

(features_train, features_test, label_train, label_test) =

    model_ selection.train_test_split(

        scaled_features, label, test_size=0.1

    )

model = linear_model.LinearRegression()

model.fit(features_train, label_train)

Now that we have access to the scaled values of the features from the last 20 years, we can now predict the index prices of the next 20 years using the following code:

label_predicted = model.predict(scaled_features_latest_240)

This sounds great in theory, but in practice, using this model for making money by betting on the forecast is by no means better than gambling in a casino. This is just an example model to illustrate prediction; it is definitely not sufficient to be used for short-term or long-term speculation on market prices.

If you look at the values, you can see why this prediction may easily backfire. First, there are a few negative values, which are impossible for indices. Then, due to a few major market crashes, linear regression made a doomsday forecast a point in the future, where the index will drop from more than 3,000 to literally zero within a year. Linear regression is not a perfect tool to look ahead 20 years based on limited data. Also, note that stock prices are meant to be close to time invariant systems. This means that the past does not imply any patterns in the future.

Let's output the prediction belonging to the first ten years:

from matplotlib import pyplot as plot

plot.plot(list(range(1,241)), label_predicted[:240])

The output is as follows:

Figure 3.17: Graph for the plot function with a range of 1 to 241 and the label predicted as 240

The graph is hard to read near the end due to extreme values. Let's draw our conclusions by omitting the last five years and just plotting the first 180 months out of the predictions:

plot.plot(list(range(1,181)), label_predicted[:180])

The output is as follows:

Figure 3.18: Graph for the plot function with a range of 1 to 181 and the label predicted as 180

This is a scary future for the American economy. According to this model, the S&P 500 has a bull run for about 2.5-3 years and doesn't recover for a long time. Also, notice that our model does not know that index values cannot be negative.

Polynomial and Support Vector Regression

When performing polynomial regression, the relationship between x and y, or using their other names, features and labels, is not a linear equation, but a polynomial equation. This means that instead of the y = a*x+b equation, we can have multiple coefficients and multiple powers of x in the equation.

To make matters even more complicated, we can perform polynomial regression using multiple variables, where each feature may have coefficients multiplying different powers of the feature.

Our task is to find a curve that best fits our dataset. Once polynomial regression is extended to multiple variables, we will learn the Support Vector Machines model to perform polynomial regression.

Polynomial Regression with One Variable

As a recap, we have performed two types of regression so far:

  • Simple linear regression: y = a*x + b
  • Multiple linear regression: y = b + a1 * x1 + a2 * x2 + … + an * xn

We will now learn how to do polynomial linear regression with one variable. The equation for polynomial linear regression is as follows:

y = b + a1*x + a2*(x ** 2) + a3*(x ** 3) + … + an * (x ** n)

have a vector of coefficients (b, a1, a2, …, an) multiplying a vector of degrees of x in the polynomial, (1, x**1, x**2, …, x**n).

At times, polynomial regression works better than linear regression. If the relationship between labels and features can be described using a linear equation, then using linear equation makes perfect sense. If we have a non-linear growth, polynomial regression tends to approximate the relationship between features and labels better.

The simplest implementation of linear regression with one variable was the polyfit method of the NumPy library. In the next exercise, we will perform multiple polynomial linear regression with degrees of 2 and 3.

Note

Even though our polynomial regression has an equation containing coefficients of x ** n, this equation is still referred to as polynomial linear regression in the literature. Regression is made linear not because we restrict the usage of higher powers of x in the equation, but because the coefficients a1, a2, …, and so on are linear in the equation. This means that we use the toolset of linear algebra, and work with matrices and vectors to find the missing coefficients that minimize the error of the approximation.

Exercise 9: 1st, 2nd, and 3rd Degree Polynomial Regression

Perform a 1st, 2nd, and 3rd degree polynomial regression on the following two datasets:

Declare the two datasets

import numpy as np

from matplotlib import pyplot as plot

# First dataset:

x1 = np.array(range(1, 14))

y1 = np.array([2, 8, 8, 18, 25, 21, 32, 44, 32, 48, 61, 45, 62])

# Second dataset:

x2 = np.array(range(1, 14))

y2 = np.array([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144])

Then plot your results on the graph:

Let's start with plotting the first example:

import matplotlib.pyplot as plot

deg1 = np.polyfit(x1, y1, 1)

# array([ 4.85714286, -2.76923077])

f1 = np.poly1d(deg1)

deg2 = np.polyfit(x1, y1, 2)

# array([-0.03196803, 5.3046953 , -3.88811189])

f2 = np.poly1d(deg2)

deg3 = np.polyfit(x1, y1, 3)

# array([-0.01136364, 0.20666833, 3.91833167, -1.97902098])

f3 = np.poly1d(deg3)

plot.plot(

    x1, y1, 'o',

    x1, f1(x1),

    x1, f2(x1),

    x1, f3(x1)

 )

plot.show()

The output is as follows:

Figure 3.19: Graph showing the first dataset with linear curves

As the coefficients are enumerated from left to right in order of decreasing degree, we can see that the higher degree coefficients stay close to negligible. In other words, the three curves are almost on top of each other, and we can only detect a divergence near the right edge. This is because we are working on a dataset that can be very well approximated with a linear model.

In fact, the first dataset was created out of a linear function. Any nonzero coefficients for x**2 and x**3 are the results of overfitting the model based on the available data. The linear model is better for predicting values outside the range of the training data than any higher degree polynomial.

Let's contrast this behavior with the second example. We know that the Fibonacci sequence is non-linear. So, using a linear equation to approximate it is a clear case for underfitting. Here, we expect a higher degree polynomic to perform better:

deg1 = np.polyfit(x2, y2, 1)

# array([ 9.12087912, -34.92307692])

f1 = np.poly1d(deg1)

deg2 = np.polyfit(x2, y2, 2)

# array([ 1.75024975, -15.38261738, 26.33566434])

f2 = np.poly1d(deg2)

deg3 = np.polyfit(x2, y2, 3)

# array([0.2465035, -3.42632368, 14.69080919,

# -15.07692308])

f3 = np.poly1d(deg3)

plot.plot(

    x2, y1, 'o',# blue dots

    x2, f1(x2), # orange

    x2, f2(x2), # green

    x2, f3(x2) # red

)

The output is as follows:

Figure 3.19: Graph showing second dataset points and three polynomial curves
Figure 3.20: Graph showing second dataset points and three polynomial curves

The difference is clear. The quadratic curve fits the points a lot better than the linear one. The cubic curve is even better.

If you research Binet's formula, you will find out that the Fibonacci function is an exponential function, as the xth Fibonacci number is calculated as the xth power of a constant. Therefore, the higher degree polynomial we use, the more accurate our approximation will be.

Polynomial Regression with Multiple Variables

When we have one variable of degree n, we have n+1 coefficients in the equation:

y = b + a1*x + a2*(x ** 2) + a3*(x ** 3) + … + an * (x ** n)

Once we deal with multiple features x1, x2, …, xm, and their powers of up to the nth degree, we get an m * (n+1) matrix of coefficients. The math will become quite lengthy once we start exploring the details and prove how a polynomial model works. We will also lose the nice visualizations of two-dimensional curves.

Therefore, we will apply the chapters learned in previous section on polynomial regression with one variable and omit the math. When training and testing a Linear Regression model, we can calculate the mean square error to see how good an approximation a model is.

In scikit-learn, the degree of the polynomials used in the approximation is a simple parameter in the model.

As polynomial regression is a form of linear regression, we can perform polynomial regression without changing the regression model. All we need to do is transform the input and keep the linear regression model.

The transformation of the input is performed by the fit_transform method of the PolynomialFeatures package:

import numpy as np

import quandl

from sklearn import preprocessing

from sklearn import model_selection

from sklearn import linear_model

from matplotlib import pyplot as plot

from sklearn.preprocessing import PolynomialFeatures

data_frame = quandl.get("YALE/SPCOMP")

data_frame.fillna(-100, inplace=True)

# We shift the price data to be predicted 20 years forward

data_frame['Real Price Label'] = data_frame['Real Price'].shift(-240)

# Then exclude the label column from the features

features = np.array(data_frame.drop('Real Price Label', 1))

# We scale before dropping the last 240 rows from the features

scaled_features = preprocessing.scale(features)

# Save the last 240 rows before dropping them

scaled_features_latest_240 = scaled_features[-240:]

# Exclude the last 240 rows from the data used for model building

scaled_features = scaled_features[:-240]

# Now we can drop the last 240 rows from the data frame

data_frame.dropna(inplace=True)

# Then build the labels from the remaining data

label = np.array(data_frame['Real Price Label'])

# Create a polynomial regressor model and fit it to the training data

poly_regressor = PolynomialFeatures(degree=3)

poly_scaled_features = poly_regressor.fit_transform(scaled_features)

# Split to training and testing data

(

    poly_features_train, poly_features_test,

    poly_label_train, poly_label_test

) = model_selection.train_test_split(

    poly_scaled_ features,

    label, test_size=0.1

)

# Apply linear regression

model = linear_model.LinearRegression()

model.fit(poly_features_train, poly_label_train)

# Model score

print('Score: ', model.score(poly_features_test, poly_label_test))

# Prediction

poly_label_predicted = model.predict(poly_features_test)

plot.scatter(poly_label_test, poly_label_predicted)

The model scores too well. Chances are, the polynomial model is overfitting the dataset.

There is another model in scikit-learn that performs polynomial regression called the SVM model, which stands for Support Vector Machines.

Support Vector Regression

Support Vector Machines are binary classifiers defined on a vector space. Vector Machines divide the state space with a surface. An SVM classifier takes classified data and tries to predict where unclassified data belongs. Once the classification of a data point is determined, it gets labeled.

Support Vector Machines can also be used for regression. Instead of labeling data, we can predict future values in a series. The Support Vector Regression model uses the space between our data as a margin of error. Based on the margin of error, it makes predictions regarding future values.

If the margin of error is too small, we risk overfitting the existing dataset. If the margin of error is too big, we risk underfitting the existing dataset.

A kernel describes the surface dividing the state space in the case of a classifier. A kernel is also used to measure the margin of error in the case of a regressor. This kernel can use a linear model, a polynomial model, or many other possible models. The default kernel is RBF, which stands for Radial Basis Function.

Support Vector Regression is an advanced topic, which is outside the scope of this book. Therefore, we will only stick to a walkthrough of how easy it is to try out another regression model on our test data.

Suppose we have our features and labels in two separate NumPy arrays. Let's recall how we performed linear regression on them:

from sklearn import model_selection

from sklearn import linear_model

(features_train, features_test, label_train,

 label_test) = model_selection.train_test_split(scaled_features, label, test_size=0.1)

model = linear_model.LinearRegression()

model.fit(features_train, label_train)

We can perform regression with Support Vector Machines by changing the linear model to a support vector model:

from sklearn import model_selection

from sklearn import svm

from matplotlib import pyplot as plot

# The rest of the model building stays the same

(features_train, features_test, label_train,

 label_test) = model_selection.train_test_split(scaled_features, label, test_size=0.1)

model = svm.SVR()

model.fit(features_train, label_train)

label_predicted = model.predict(features_test)

print('Score: ', model.score(features_test, label_test))

plot.plot(label_test, label_predicted, 'o')

The output is as follows:

Figure 3.21: Graph showing Support Vector regression with a linear model

The output is as follows:

 -0.19365084431020874

The model score is quite low, and the points don't align on the y=x line. Prediction with the default values is quite low.

The output of the model describes the parameters of the Support Vector Machine:

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

We could fiddle with these parameters to increase the accuracy of the prediction by creating a better algorithm.

Support Vector Machines with a 3 Degree Polynomial Kernel

Let's switch the kernel of the Support Vector Machine to poly. The default degree of the polynomial is 3:

model = svm.SVR(kernel='poly')

model.fit(features_train, label_train)

label_predicted = model.predict(features_test)

plot.plot(label_test, label_predicted, 'o')

The output is as follows:

Figure 3.22: Graph showing Support Vector regression with a polynomial kernel of degree 3

model.score(features_test, label_test)

The output is as follows:

0.06388628722032952

With Support Vector Machines, we often end up with points concentrated in small areas. We could change the margin of error to separate the points a bit more.

Activity 6: Stock Price Prediction with Quadratic and Cubic Linear Polynomial Regression with Multiple Variables

In this section, we will discuss how to perform linear, polynomial, and support vector regression with scikit-learn. We will also learn how to find the best fit model for a given task. We will be assuming that you are a software engineer at a financial institution and your employer wants to know whether linear regression or support vector regression is a better fit for predicting stock prices. You will have to load all of the data of the S&P 500 from a data source. Then, you will need to build a regressor using linear regression, cubic polynomial linear regression, and a support vector regression with a polynomial kernel of degree 3 before separating the training and test data. Plot the test labels and the prediction results and compare them with the y=x line. Finally, compare how well the three models score.

  1. Load the S&P 500 index data using Quandl, and then prepare the data for prediction.
  2. Use a polynomial of degree 1 for the evaluation of the model and for the prediction.

    The closer the dots are to the y=x line, the less error the model works with.

    Perform a linear multiple regression with quadratic polynomials. The only change is in the Linear Regression model.

  3. Perform Support Vector regression with a polynomial kernel of degree 3.

The model does not look efficient at all. For some reason, this model clearly prefers lower values for the S&P 500 that are completely unrealistic, assuming that the stock market does not lose 80% of its value within a day.

Note

The solution to this activity is available at page 271.

Summary

In this chapter, we have learned the fundamentals of Linear Regression.

After going through some basic mathematics, we dived into the mathematics of linear regression using one variable and multiple variables.

Challenges occurring with regression include loading data from external sources such as a .csv file, Yahoo Finance, or Quandl were dealt with. After loading the data, we learned how to identify the features and labels, how to scale data, and how to format data to perform regression.

We learned how to train and test a linear regression engine, and how to predict the future. Our results were visualized by an easy-to-use Python graph plotting library called pyplot.

A more complex form of linear regression is a linear polynomial regression of arbitrary degree. We learned how to define these regression problems on multiple variables. We compared their performance to each other on stock price prediction problems. As an alternative to polynomial regression, we also introduced Support Vector Machines as a regression model and experimented with two kernels.

You will soon learn about another field inside machine learning. The setup and code structure of this machine learning method will be very similar to regression, while the problem domain is somewhat different. In the next chapter, you will learn the ins and outs of classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset