Let's move on to building our model. We will start by identifying our numerical and categorical variables. We study the correlations using the correlation matrix and the correlation plots.
- First, we'll take a look at the variables and the variable types:
# See the variables and their data types
df_housingdata.dtypes
- We'll then look at the correlation matrix. The corr() method computes the pairwise correlation of columns:
# We pass 'pearson' as the method for calculating our correlation
df_housingdata.corr(method='pearson')
- Besides this, we'd also like to study the correlation between the predictor variables and the response variable:
# we store the correlation matrix output in a variable
pearson = df_housingdata.corr(method='pearson')
# assume target attr is the last, then remove corr with itself
corr_with_target = pearson.iloc[-1][:-1]
# attributes sorted from the most predictive
corr_with_target.sort_values(ascending=False)
- We can look at the correlation plot using the heatmap() function from the seaborn package:
f, ax = plt.subplots(figsize=(11, 11))
# Generate a mask for the upper triangle
# np.zeros_like - Return an array of zeros with the same shape and type as a given array
# In this case we pass the correlation matrix
# we create a variable “mask” which is a 14 X 14 numpy array
mask = np.zeros_like(pearson, dtype=np.bool)
tt = np.triu_indices_from(mask)
# We create a tuple with triu_indices_from() by passing the “mask” array
# k is used to offset diagonal
# with k=0, we offset all diagnoals
# If we put k=13, means we offset 14-13=1 diagonal
# triu_indices_from() Return the indices for the upper-triangle of arr.
mask[np.triu_indices_from(mask, k=0)] = True
# First 2 param - anchor hues for negative and positive extents of the map.
# 3rd param - Anchor saturation for both extents of the map
# If true, return a matplotlib colormap object rather than a list of colors.
cmap = sns.diverging_palette(10, 129, s=50, as_cmap=True)
# Adjust size of the legend bar with cbar_kws={“shrink”: 0.5}
# cmap=“YlGnBu” gives the color from Yellow-Green-Blue palette
sns.heatmap(pearson, mask=mask, cmap="YlGnBu", vmax=.3, center=0,
square=True, linewidths=.1, cbar_kws={"shrink": 0.5})
The following screenshot is the correlation plot. Note that we have removed the upper triangle of the heatmap using the np.zeros_like() and np.triu_indices_from() functions:
Let's explore our data by visualizing other variables.
- We can look at the distribution of our target variable, SalePrice, using a histogram with a kernel density estimator as follows:
# Setting the plot size
plt.figure(figsize=(8, 8))
sns.distplot(df_housingdata['SalePrice'], bins=50, kde=True)
The following screenshot gives us the distribution plot for the SalePrice variable:
- We can also use JointGrid() from our seaborn package to plot a combination of plots:
from scipy import stats
g = sns.JointGrid(df_housingdata['YearBuilt'], df_housingdata['SalePrice'])
g = g.plot(sns.regplot, sns.distplot)
g = g.annotate(stats.pearsonr)
With the preceding code, we are able to plot the scatter plot for GarageArea and SalePrice, while also plotting the histogram for each of these variables on each axis:
- Let's now scale our numeric variables using min-max normalization. To do this, we first need to select only the numeric variables from our dataset:
# create a variable to hold the names of the data types viz int16, in32 and so on
num_cols = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
# Filter out variables with numeric data types
df_numcols_only = df_housingdata.select_dtypes(include=num_cols)
- We will now apply the min-max scaling to our numeric variables:
# Importing MinMaxScaler and initializing it
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
# Scaling down the numeric variables
# We exclude SalePrice using iloc() on df_numcols_only DataFrame
df_housingdata_numcols=pd.DataFrame(min_max.fit_transform(df_numcols_only.iloc[:,0:36]), columns=df_numcols_only.iloc[:,0:36].columns.tolist())
In the following table, we can see that our numeric variables have been scaled down:
- Now, we will perform one-hot encoding on our categorical variables:
# We exclude all numeric columns
df_housingdata_catcol = df_housingdata.select_dtypes(exclude=num_cols)
# Steps to one-hot encoding:
# We iterate through each categorical column name
# Create encoded variables for each categorical columns
# Concatenate the encoded variables to the DataFrame
# Remove the original categorical variable
for col in df_housingdata_catcol.columns.values:
one_hot_encoded_variables = pd.get_dummies(df_housingdata_catcol[col],prefix=col)
df_housingdata_catcol = pd.concat([df_housingdata_catcol,one_hot_encoded_variables],axis=1)
df_housingdata_catcol.drop([col],axis=1, inplace=True)
- We have now created a DataFrame with only numeric variables that have been scaled. We have also created a DataFrame with only categorical variables that have been encoded. Let's combine the two DataFrames into a single DataFrame:
df_housedata = pd.concat([df_housingdata_numcols, df_housingdata_catcol], axis=1)
- We can then concatenate the SalePrice variable to our df_housedata DataFrame:
# Concatenate SalePrice to the final DataFrame
df_housedata_final = pd.concat([df_housedata, df_numcols_only.iloc[:,36]], axis=1)
- We can create our training and testing datasets using the train_test_split class from sklearn.model_selection:
# Create feature and response variable set
# We create train & test sample from our dataset
from sklearn.model_selection import train_test_split
# create feature & response variables
X = df_housedata_final.iloc[:,0:302]
Y = df_housedata_final['SalePrice']
# Create train & test sets
X_train, X_test, Y_train, Y_test =
train_test_split(X, Y, test_size=0.30, random_state=1)
- We can now use SGDRegressor() to build a linear model. We fit this linear model by minimizing the regularized empirical loss with SGD:
import numpy as np
from sklearn.linear_model import SGDRegressor
lin_model = SGDRegressor()
# We fit our model with train data
lin_model.fit(X_train, Y_train)
# We use predict() to predict our values
lin_model_predictions = lin_model.predict(X_test)
# We check the coefficient of determination with score()
print(lin_model.score(X_test, Y_test))
# We can also check the coefficient of determination with r2_score() from sklearn.metrics
from sklearn.metrics import r2_score
print(r2_score(Y_test, lin_model_predictions))
By running the preceding code, we find out that the coefficient of determination is roughly 0.81.
- We check the root mean square error (RMSE) on the test data:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, lin_model_predictions)
rmse = np.sqrt(mse)
print(rmse)
Running the preceding code provides output to the effect that the RMSE equals 36459.44.
- We now plot the actual and predicted values using matplotlib.pyplot:
plt.figure(figsize=(8, 8))
plt.scatter(Y_test, lin_model_predictions)
plt.xlabel('Actual Median value of house prices ($1000s)')
plt.ylabel('Predicted Median value of house prices ($1000s)')
plt.tight_layout()
The resulting plot with our actual values and the predicted values will look as follows:
Because the chart shows most values in approximately a 45-degree diagonal line, our predicted values are quite close to the actual values, apart from a few.