We'll now do the same exercise with LOOCV by using LeaveOneOut from sklearn.model_selection:
- We'll read our data once again and split it into the features and response sets:
# Let's read our data.
df_autodata = pd.read_csv("autompg.csv")
# Fill NAs with the median value
df_autodata['horsepower'].fillna(df_autodata['horsepower'].median(), inplace=True)
# Drop carname variable
df_autodata.drop(['carname'], axis=1, inplace=True)
X = df_autodata.iloc[:,1:8]
Y = df_autodata.iloc[:,0]
X=np.array(X)
Y=np.array(Y)
- We use LOOCV to build our models:
from sklearn.model_selection import LeaveOneOut
loocv = LeaveOneOut()
loo_ytests = []
loo_predictedvalues = []
mean_mse = 0.0
for train_index, test_index in loocv.split(X):
# the below requires arrays. So we converted the dataframes to arrays
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
# there is only one y-test and y-pred per iteration over the loo.split,
# so we append them to the respective lists.
loo_ytests += list(Y_test)
loo_predictedvalues += list(Y_pred)
mse = mean_squared_error(loo_ytests, loo_predictedvalues)
r2score = r2_score(loo_ytests, loo_predictedvalues)
print("R^2: {:.2f}, MSE: {:.2f}".format(r2score, mse))
mean_mse += mse
- We can look at our coefficient of determination using r2_score() and the mean squared error using mse():
print("Average CV Score :" ,mean_mse/X.shape[0])
We can take a look at the coefficient of determination, and the mean squared error for the LOOCV results:
- We can plot the predicted values against the actual values of the response variable:
## Let us plot the model
plt.scatter(kf_ytests, kf_predictedvalues)
plt.xlabel('Reported mpg')
plt.ylabel('Predicted mpg')
The plot that is generated by the preceding code gives us the following output:
In LOOCV, there is no randomness in the splitting method, so it'll always provide you with the same result.
The stratified k-fold CV method is often used in classification problems. This is a variation of the k-fold CV method that returns stratified folds. Each set contains a similar percentage of samples of each target class as the original dataset. startifiedShuffleSplit is a variation of shuffle splits, which creates splits by maintaining the same percentage for every target class.