There's more...

We'll now do the same exercise with LOOCV by using LeaveOneOut from sklearn.model_selection:

  1. We'll read our data once again and split it into the features and response sets:
# Let's read our data. 
df_autodata = pd.read_csv("autompg.csv")

# Fill NAs with the median value
df_autodata['horsepower'].fillna(df_autodata['horsepower'].median(), inplace=True)

# Drop carname variable
df_autodata.drop(['carname'], axis=1, inplace=True)

X = df_autodata.iloc[:,1:8]
Y = df_autodata.iloc[:,0]
  1. We use LOOCV to build our models:
from sklearn.model_selection import LeaveOneOut 
loocv = LeaveOneOut()

loo_ytests = []
loo_predictedvalues = []
mean_mse = 0.0

for train_index, test_index in loocv.split(X):
# the below requires arrays. So we converted the dataframes to arrays
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]

model = LinearRegression(), Y_train)
Y_pred = model.predict(X_test)

# there is only one y-test and y-pred per iteration over the loo.split,
# so we append them to the respective lists.

loo_ytests += list(Y_test)
loo_predictedvalues += list(Y_pred)

mse = mean_squared_error(loo_ytests, loo_predictedvalues)
r2score = r2_score(loo_ytests, loo_predictedvalues)
print("R^2: {:.2f}, MSE: {:.2f}".format(r2score, mse))
mean_mse += mse
  1. We can look at our coefficient of determination using r2_score() and the mean squared error using mse():
print("Average CV Score :" ,mean_mse/X.shape[0]) 

We can take a look at the coefficient of determination, and the mean squared error for the LOOCV results:

  1. We can plot the predicted values against the actual values of the response variable:
## Let us plot the model
plt.scatter(kf_ytests, kf_predictedvalues)
plt.xlabel('Reported mpg')
plt.ylabel('Predicted mpg')

The plot that is generated by the preceding code gives us the following output:

In LOOCV, there is no randomness in the splitting method, so it'll always provide you with the same result.

The stratified k-fold CV method is often used in classification problems. This is a variation of the k-fold CV method that returns stratified folds. Each set contains a similar percentage of samples of each target class as the original dataset. startifiedShuffleSplit is a variation of shuffle splits, which creates splits by maintaining the same percentage for every target class.

