Linear regression

The main goal of using linear regression is to predict a numeric target value. One way to do this is to write an equation for the target value with respect to the inputs. For example, assume that we are trying to forecast the acceptance rate of a fully rounded student who participates in sports and music, but belongs to a low-income family.

One possible equation is acceptance = 0.0015*income + 0.49*(participation_score); this is a regression equation. This uses a simple linear regression to predict a quantitative response with a single feature. It takes the following form:

Linear regression

Together, β0 and β1 are called the model coefficients. To create our model, you must learn the values of these coefficients. Once you've learned these coefficients, you can use the model to predict the acceptance rate reasonably.

These coefficients are estimated using the least squares criteria, which means that we will find the separating line mathematically and minimize the sum of squared residuals. The following is a portion of the data that is used in the following example:

Linear regression

The following Python code shows how one can attempt scatter plots to determine the correlation between variables:

from matplotlib import pyplot as pplt

import pandas as pds

import statsmodels.formula.api as sfapi

df = pds.read_csv('/Users/myhomedir/sports.csv', index_col=0)
fig, axs = plt.subplots(1, 3, sharey=True)
df.plot(kind='scatter', x='sports', y='acceptance', ax=axs[0], figsize=(16, 8))
df.plot(kind='scatter', x='music', y='acceptance', ax=axs[1])
df.plot(kind='scatter', x='academic', y='acceptance', ax=axs[2])

# create a fitted model in one line
lmodel = sfapi.ols(formula='acceptance ~ music', data=df).fit()

X_new = pd.DataFrame({'music': [df.music.min(), df.music.max()]})
predictions = lmodel.predict(X_new)

df.plot(kind='scatter', x='music', y='acceptance', figsize=(12,12), s=50)

plt.title("Linear Regression - Fitting Music vs Acceptance Rate", fontsize=20)
plt.xlabel("Music", fontsize=16)
plt.ylabel("Acceptance", fontsize=16)

# then, plot the least squares line
Linear regression

As shown in the preceding image, the blue dots are the observed values of (x,y), the line that crosses diagonally is the least square fit based on the (x,y) values, and the orange lines are the residuals, which are the distances between the observed values and the least squares line.

Linear regression

Using statsmodels, pandas, and matplotlib (as shown in the preceding image), we can assume that there is some sort of scoring based on how a university rates its students' contribution to academics, sports, and music.

To test a classifier, we can start with some known data and not knowing the answer, we will seek the answer from the classifier for its best guess. In addition, we can add the number of times the classifier was wrong and divide it by the total number of tests conducted to get the error rate.

The following is a plot of linear regression derived from the previous Python code.

Linear regression

There are numerous other Python libraries that one can use for linear regression, and scikit-learn, seaborn, statsmodels, and mlpy are some of the notable and popular libraries among them. There are numerous examples already on the Web that explains linear regression with these packages. For details on the scikit-learn package, refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

There is another interesting machine learning model called decision tree learning, which can sometimes be referred to as classification tree. Another similar model is regression tree. Here, we will see the differences between them and whether one makes sense over the other.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset