Building a recommendation system with an item-based collaborative filtering technique

The recommenderlab package of R offers the item-based collaborative filtering (ITCF) option to build a recommendation system. This is a very straightforward approach that just needs us to call the function and supply it with the necessary parameters. The parameters, in general, will have a lot of influence on the performance of the model; therefore, testing each parameter combination is the key to obtaining the best model for recommendations. The following are the parameters that can be passed to the Recommender function:

  • Data normalization: Normalizing the ratings matrix is a key step in preparing the data for the recommendation engine. The process of normalization processes the ratings in the matrix by removing the rating bias. The possible values for this parameter are NULL, Center, and Z-Score.
  • Distance: This represents the type of similarity metric to be used within the model. The possible values for this parameter are Cosine similarity, Euclidean distance, and Pearson's correlation.

With these parameter combinations, we could build and test 3 x 3 ITCF models. The basic intuition behind ITCF is that if a person likes item A, there is a good probability that they like item B as well, as long as items A and B are similar. It may be understood that the term similar does not indicate similarity between the items based on the item's attributes, but, a similarity in user preferences, for example, a group of people that liked items A also liked item B. The following diagram shows the working principle of ITCF:

Example showing the working of item based collaborative filtering

Let's explore the diagram in a little more detail. In ITCF, the watermelon and grapes will form the similar-items neighborhood, which means that irrespective of users, different items that are equivalent will form a neighborhood. So when user X likes watermelon, the other item from the same neighborhood, which is grapes, will be recommended by the recommender system based on item-based collaborative filter.

ITCF involves the following three steps:

  1. Computing the item-based similarities through a distance measure: This involves computing the distance between the items. The distance may be computed with one of the many distance measures, such as Cosine similarity, Euclidean distance, Manhattan distance, or Jaccard index. The output of this step is to obtain a similarity matrix where each cell corresponds to the similarity of the item specified on the row of the cell and the item specified on the column of the cell.
  1. Predicting the targeted item rating for a specific user: The rating is arrived at by computing the weighted sum of ratings made to the item very similar to the target item.
  2. Recommending the top N items: Once all the items are predicted, we recommend the top N items.

Now, let's build each one of the ITCF models and measure the performance against the test dataset. The following code trains the ITCF models with several parameter combinations:

type = "IBCF"
##train ITCF cosine similarity models
# non-normalized
ITCF_N_C <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = NULL, method="Cosine"))
# centered
ITCF_C_C <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "center",method="Cosine"))
# Z-score normalization
ITCF_Z_C <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "Z-score",method="Cosine"))
##train ITCF Euclidean Distance models
# non-normalized
ITCF_N_E <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = NULL, method="Euclidean"))
# centered
ITCF_C_E <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "center",method="Euclidean"))
# Z-score normalization
ITCF_Z_E <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "Z-score",method="Euclidean"))
#train ITCF pearson correlation models
# non-normalized
ITCF_N_P <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = NULL, method="pearson"))
# centered
ITCF_C_P <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "center",method="pearson"))
# Z-score normalization
ITCF_Z_P <- Recommender(getData(Jester5k_es, "train"), type,
param=list(normalize = "Z-score",method="pearson"))

We now have the ITCF models, so let's get to computing the performance on the test data with each of the models we have created. The objective is to identify the best-performing ITCF model for this dataset. The following code gets the performance measurements with all the nine models on the test dataset:

# compute predicted ratings from each of the 9 models on the test dataset
pred1 <- predict(ITCF_N_C, getData(Jester5k_es, "known"), type="ratings")
pred2 <- predict(ITCF_C_C, getData(Jester5k_es, "known"), type="ratings")
pred3 <- predict(ITCF_Z_C, getData(Jester5k_es, "known"), type="ratings")
pred4 <- predict(ITCF_N_E, getData(Jester5k_es, "known"), type="ratings")
pred5 <- predict(ITCF_C_E, getData(Jester5k_es, "known"), type="ratings")
pred6 <- predict(ITCF_Z_E, getData(Jester5k_es, "known"), type="ratings")
pred7 <- predict(ITCF_N_P, getData(Jester5k_es, "known"), type="ratings")
pred8 <- predict(ITCF_C_P, getData(Jester5k_es, "known"), type="ratings")
pred9 <- predict(ITCF_Z_P, getData(Jester5k_es, "known"), type="ratings")
# set all predictions that fall outside the valid range to the boundary values
pred1@data@x[pred1@data@x[] < -10] <- -10
pred1@data@x[pred1@data@x[] > 10] <- 10
pred2@data@x[pred2@data@x[] < -10] <- -10
pred2@data@x[pred2@data@x[] > 10] <- 10
pred3@data@x[pred3@data@x[] < -10] <- -10
pred3@data@x[pred3@data@x[] > 10] <- 10
pred4@data@x[pred4@data@x[] < -10] <- -10
pred4@data@x[pred4@data@x[] > 10] <- 10
pred5@data@x[pred5@data@x[] < -10] <- -10
pred5@data@x[pred5@data@x[] > 10] <- 10
pred6@data@x[pred6@data@x[] < -10] <- -10
pred6@data@x[pred6@data@x[] > 10] <- 10
pred7@data@x[pred7@data@x[] < -10] <- -10
pred7@data@x[pred7@data@x[] > 10] <- 10
pred8@data@x[pred8@data@x[] < -10] <- -10
pred8@data@x[pred8@data@x[] > 10] <- 10
pred9@data@x[pred9@data@x[] < -10] <- -10
pred9@data@x[pred9@data@x[] > 10] <- 10
# aggregate the performance measurements obtained from all the models
error_ITCF <- rbind(
ITCF_N_C = calcPredictionAccuracy(pred1, getData(Jester5k_es, "unknown")),
ITCF_C_C = calcPredictionAccuracy(pred2, getData(Jester5k_es, "unknown")),
ITCF_Z_C = calcPredictionAccuracy(pred3, getData(Jester5k_es, "unknown")),
ITCF_N_E = calcPredictionAccuracy(pred4, getData(Jester5k_es, "unknown")),
ITCF_C_E = calcPredictionAccuracy(pred5, getData(Jester5k_es, "unknown")),
ITCF_Z_E = calcPredictionAccuracy(pred6, getData(Jester5k_es, "unknown")),
ITCF_N_P = calcPredictionAccuracy(pred7, getData(Jester5k_es, "unknown")),
ITCF_C_P = calcPredictionAccuracy(pred8, getData(Jester5k_es, "unknown")),
ITCF_Z_P = calcPredictionAccuracy(pred9, getData(Jester5k_es, "unknown"))
)
library(knitr)
kable(error_ITCF)

This will result in the following output:

|         |     RMSE|      MSE|      MAE|
|:--------|--------:|--------:|--------:|
|ITCF_N_C | 4.533455| 20.55221| 3.460860|
|ITCF_C_C | 5.082643| 25.83326| 4.012391|
|ITCF_Z_C | 5.089552| 25.90354| 4.021435|
|ITCF_N_E | 4.520893| 20.43848| 3.462490|
|ITCF_C_E | 4.519783| 20.42844| 3.462271|
|ITCF_Z_E | 4.527953| 20.50236| 3.472080|
|ITCF_N_P | 4.582121| 20.99583| 3.522113|
|ITCF_C_P | 4.545966| 20.66581| 3.510830|
|ITCF_Z_P | 4.569294| 20.87845| 3.536400|

We see the output that the ITCF recommendation application on data with the Euclidean distance yielded the best performance measurement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset