How to do it...

Let's move on to building our model. We will start by identifying our numerical and categorical variables. We study the correlations using the correlation matrix and the correlation plots.

First, we'll take a look at the variables and the variable types:

# See the variables and their data types
df_housingdata.dtypes

We'll then look at the correlation matrix. The corr() method computes the pairwise correlation of columns:

# We pass 'pearson' as the method for calculating our correlation
df_housingdata.corr(method='pearson')

Besides this, we'd also like to study the correlation between the predictor variables and the response variable:

# we store the correlation matrix output in a variable
pearson = df_housingdata.corr(method='pearson')

# assume target attr is the last, then remove corr with itself
corr_with_target = pearson.iloc[-1][:-1]

# attributes sorted from the most predictive
corr_with_target.sort_values(ascending=False)

We may also want to sort our correlation by absolute values. In order to do this, we can use the following command: corr_with_target[abs(corr_with_target).argsort()[::-1]]

We can look at the correlation plot using the heatmap() function from the seaborn package:

f, ax = plt.subplots(figsize=(11, 11))

# Generate a mask for the upper triangle
# np.zeros_like - Return an array of zeros with the same shape and type as a given array
# In this case we pass the correlation matrix
# we create a variable “mask” which is a 14 X 14 numpy array

mask = np.zeros_like(pearson, dtype=np.bool)
tt = np.triu_indices_from(mask)

# We create a tuple with triu_indices_from() by passing the “mask” array
# k is used to offset diagonal
# with k=0, we offset all diagnoals
# If we put k=13, means we offset 14-13=1 diagonal

# triu_indices_from() Return the indices for the upper-triangle of arr.
mask[np.triu_indices_from(mask, k=0)] = True

# First 2 param - anchor hues for negative and positive extents of the map.
# 3rd param - Anchor saturation for both extents of the map
# If true, return a matplotlib colormap object rather than a list of colors.

cmap = sns.diverging_palette(10, 129, s=50, as_cmap=True)

# Adjust size of the legend bar with cbar_kws={“shrink”: 0.5}
# cmap=“YlGnBu” gives the color from Yellow-Green-Blue palette

sns.heatmap(pearson, mask=mask, cmap="YlGnBu", vmax=.3, center=0,
           square=True, linewidths=.1, cbar_kws={"shrink": 0.5})

The following screenshot is the correlation plot. Note that we have removed the upper triangle of the heatmap using the np.zeros_like() and np.triu_indices_from() functions:

Let's explore our data by visualizing other variables.

We can look at the distribution of our target variable, SalePrice, using a histogram with a kernel density estimator as follows:

# Setting the plot size
plt.figure(figsize=(8, 8))

sns.distplot(df_housingdata['SalePrice'], bins=50, kde=True)

The following screenshot gives us the distribution plot for the SalePrice variable:

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are based on a finite data sample. KDE is a technique that provides you with a smooth curve given a set of data. It can be handy if you want to visualize the shape of some data, as a kind of continuous replacement for the discrete values plotted in a histogram.

We can also use JointGrid() from our seaborn package to plot a combination of plots:

from scipy import stats
g = sns.JointGrid(df_housingdata['YearBuilt'], df_housingdata['SalePrice'])
g = g.plot(sns.regplot, sns.distplot)
g = g.annotate(stats.pearsonr)

With the preceding code, we are able to plot the scatter plot for GarageArea and SalePrice, while also plotting the histogram for each of these variables on each axis:

Let's now scale our numeric variables using min-max normalization. To do this, we first need to select only the numeric variables from our dataset:

# create a variable to hold the names of the data types viz int16, in32 and so on
num_cols = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

# Filter out variables with numeric data types
df_numcols_only = df_housingdata.select_dtypes(include=num_cols)

We will now apply the min-max scaling to our numeric variables:

# Importing MinMaxScaler and initializing it
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()

# Scaling down the numeric variables
# We exclude SalePrice using iloc() on df_numcols_only DataFrame
df_housingdata_numcols=pd.DataFrame(min_max.fit_transform(df_numcols_only.iloc[:,0:36]), columns=df_numcols_only.iloc[:,0:36].columns.tolist())

In the following table, we can see that our numeric variables have been scaled down:

Now, we will perform one-hot encoding on our categorical variables:

# We exclude all numeric columns
df_housingdata_catcol = df_housingdata.select_dtypes(exclude=num_cols)

# Steps to one-hot encoding:
# We iterate through each categorical column name
# Create encoded variables for each categorical columns
# Concatenate the encoded variables to the DataFrame
# Remove the original categorical variable
for col in df_housingdata_catcol.columns.values:
   one_hot_encoded_variables = pd.get_dummies(df_housingdata_catcol[col],prefix=col)
   df_housingdata_catcol = pd.concat([df_housingdata_catcol,one_hot_encoded_variables],axis=1)
   df_housingdata_catcol.drop([col],axis=1, inplace=True)

We have now created a DataFrame with only numeric variables that have been scaled. We have also created a DataFrame with only categorical variables that have been encoded. Let's combine the two DataFrames into a single DataFrame:

df_housedata = pd.concat([df_housingdata_numcols, df_housingdata_catcol], axis=1)

We can then concatenate the SalePrice variable to our df_housedata DataFrame:

# Concatenate SalePrice to the final DataFrame
df_housedata_final = pd.concat([df_housedata, df_numcols_only.iloc[:,36]], axis=1)

We can create our training and testing datasets using the train_test_split class from sklearn.model_selection:

# Create feature and response variable set
# We create train & test sample from our dataset
from sklearn.model_selection import train_test_split

# create feature & response variables
X = df_housedata_final.iloc[:,0:302]
Y = df_housedata_final['SalePrice']

# Create train & test sets
X_train, X_test, Y_train, Y_test = 
train_test_split(X, Y, test_size=0.30, random_state=1)

We can now use SGDRegressor() to build a linear model. We fit this linear model by minimizing the regularized empirical loss with SGD:

import numpy as np
from sklearn.linear_model import SGDRegressor

lin_model = SGDRegressor()

# We fit our model with train data
lin_model.fit(X_train, Y_train)

# We use predict() to predict our values
lin_model_predictions = lin_model.predict(X_test)

# We check the coefficient of determination with score()
print(lin_model.score(X_test, Y_test))

# We can also check the coefficient of determination with r2_score() from sklearn.metrics
from sklearn.metrics import r2_score
print(r2_score(Y_test, lin_model_predictions))

By running the preceding code, we find out that the coefficient of determination is roughly 0.81.

Note that r2_score() takes two arguments. The first argument should be the true values, not the predicted values, otherwise, it would return an incorrect result.

We check the root mean square error (RMSE) on the test data:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, lin_model_predictions)
rmse = np.sqrt(mse)
print(rmse)

Running the preceding code provides output to the effect that the RMSE equals 36459.44.

We now plot the actual and predicted values using matplotlib.pyplot:

plt.figure(figsize=(8, 8))
plt.scatter(Y_test, lin_model_predictions)
plt.xlabel('Actual Median value of house prices ($1000s)')
plt.ylabel('Predicted Median value of house prices ($1000s)')
plt.tight_layout()

The resulting plot with our actual values and the predicted values will look as follows:

Because the chart shows most values in approximately a 45-degree diagonal line, our predicted values are quite close to the actual values, apart from a few.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...