Train test split

Once we have built our recommendation engine using collaborative filtering, we don't want to wait till the model is deployed in production to get to know it's performance. We want to produce the best performing recommendation engine. Therefore, during the development process, we can split our data into a training set and a test set. We can build our algorithm on the training set and test it against our test set to validate or infer the performance of our collaborative filtering method.

The recommenderlab package provides the necessary infrastructure to achieve this. The evaluationScheme class is provided to help us create a train/test strategy. It takes a ratingsMatrix as an input and provides several schemes including simple split, bootstrap sampling, and k-fold cross-validation to create an evaluation scheme.

Let us invoke the evaluationScheme with our input matrix:

> plan <- evaluationScheme(data, method="split", train=0.9, given=10, goodRating=2)
> plan
Evaluation scheme with 10 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.900
Good ratings: >=5.000000
Data set: 1500 x 100 rating matrix of class 'realRatingMatrix' with 107914 ratings.

Using the evaluationScheme method, we allocate 90% of our data to training. The given parameter achieves the following. From the test data, it takes 10 jokes for each user and keeps them aside for evaluation.The model will predict the rating for these ten jokes of the test records and these can be compared to the actual values. That way we can evaluate the performance of our recommender systems. Finally, the parameter goodRating, defines the threshold for good rating, here we say any rating greater than or equal to 1 is a positive rating. This is critical for evaluating our classifier. Let us say if we want to find the accuracy of our recommender system, then we need to find out the number of ratings where our predictions matched with the actual. So if the ratings of both actual and predicted is greater than or equal to 1, then they are considered as a match. We choose 2 here looking at the density plot of z-score normalized data.

The method parameter is very important as this decides the evaluation scheme. In our example, we have used split. In split it randomly assigns objects to train or test based on the proportion given.

Another important parameter we did not include is k. By default, the value of k is set to 1 for the split method. It decides the number of times to run the evaluation scheme. If we had selected k-fold cross-validation as our method, k defaults to 10. A total of 10 different models will be built for 10 splits of the data and the performance will be an average of all the 10 models.

The following figure compares the split and k-fold schemes:

As you can see in a three-fold validation, we have three splits of our train and test data. The model is evaluated against all the three splits.

Refer to https://en.wikipedia.org/wiki/Cross-validation_(statistics) to find out more about cross-validation and, specifically, about k-fold cross-validation.

You can evoke R Help for the documentation:

 help("evaluationScheme")

Now that we have built our scheme, we can use the scheme to extract our test and train dataset.

Extract the train and test data as follows:

> set.seed(100)
> data = normalize(data, method ="Z-score")

> train <- getData(plan, type = "train")
> train
1350 x 100 rating matrix of class 'realRatingMatrix' with 96678 ratings.

> test <- getData(plan, type = "unknown")
> test
150 x 100 rating matrix of class 'realRatingMatrix' with 9736 ratings.
>
> test.known <- getData(plan, "known")
> test.known
150 x 100 rating matrix of class 'realRatingMatrix' with 1500 ratings.
>

The getData function is used to extract the train, and test dataset. The type parameter returns different datasets. The train type returns the training dataset. The known type returns the known ratings from the test dataset. The unknown type returns the ratings used for evaluation of the dataset.

This is a standard technique followed in any machine learning algorithm development. The train dataset is used to train the model and the test dataset is used to test the performance of the model. If the model performs poorly, we go back to using the train dataset and tune the model. Tuning may involve either changing the approach (that is, instead of using user-based models, let us say we move to latent-based models and evaluate the performance) or we start changing some of the parameters of our existing approach. Say in the case of user-based filtering, we change the number of neighbors.

Our test dataset is further divided into test.known and test. The dataset test.known, the ratings for 10 products. Using those ratings, our recommender system will predict the recommendations. These recommendations can be now compared with the actual value from test data set.

Let us look at our test data set:

> dim(train@data)
[1] 1350 100
> dim(test@data)
[1] 150 100

Out of a total of 1,500 users, 90% of the data (that is 1,350), is allocated to train. A set of 150 users are reserved for the purpose of testing.

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset