There's more...

We'll now do the same exercise with LOOCV by using LeaveOneOut from sklearn.model_selection:

We'll read our data once again and split it into the features and response sets:

# Let's read our data. 
df_autodata = pd.read_csv("autompg.csv")

# Fill NAs with the median value
df_autodata['horsepower'].fillna(df_autodata['horsepower'].median(), inplace=True)

# Drop carname variable
df_autodata.drop(['carname'], axis=1, inplace=True)

X = df_autodata.iloc[:,1:8]
Y = df_autodata.iloc[:,0]
X=np.array(X)
Y=np.array(Y)

We use LOOCV to build our models:

from sklearn.model_selection import LeaveOneOut 
loocv = LeaveOneOut()

loo_ytests = []
loo_predictedvalues = []
mean_mse = 0.0

for train_index, test_index in loocv.split(X):
    # the below requires arrays. So we converted the dataframes to arrays
    X_train, X_test = X[train_index], X[test_index] 
    Y_train, Y_test = Y[train_index], Y[test_index]
    
    model = LinearRegression()
    model.fit(X_train, Y_train) 
    Y_pred = model.predict(X_test)
        
    # there is only one y-test and y-pred per iteration over the loo.split, 
    # so we append them to the respective lists.
        
    loo_ytests += list(Y_test)
    loo_predictedvalues += list(Y_pred)
    
    mse = mean_squared_error(loo_ytests, loo_predictedvalues)
    r2score = r2_score(loo_ytests, loo_predictedvalues)
    print("R^2: {:.2f}, MSE: {:.2f}".format(r2score, mse))
    mean_mse += mse

We can look at our coefficient of determination using r2_score() and the mean squared error using mse():

print("Average CV Score :" ,mean_mse/X.shape[0])

We can take a look at the coefficient of determination, and the mean squared error for the LOOCV results:

We can plot the predicted values against the actual values of the response variable:

## Let us plot the model
plt.scatter(kf_ytests, kf_predictedvalues)
plt.xlabel('Reported mpg')
plt.ylabel('Predicted mpg')

The plot that is generated by the preceding code gives us the following output:

In LOOCV, there is no randomness in the splitting method, so it'll always provide you with the same result.

The stratified k-fold CV method is often used in classification problems. This is a variation of the k-fold CV method that returns stratified folds. Each set contains a similar percentage of samples of each target class as the original dataset. startifiedShuffleSplit is a variation of shuffle splits, which creates splits by maintaining the same percentage for every target class.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...