User-based models

We discussed the the user-based model in the previous sections. Let us do a quick recap here. The user-based model for collaborative filtering tries to mimic the word-of-mouth approach in marketing. It is a memory-based model. The premise of this algorithm is that similar users will have a similar taste in jokes. Therefore, they will rate jokes in a more or less similar manner. It's a two-step process, in the first step for a given user, the algorithm finds his neighbors. A similarity distance measure, such as Pearson coefficient, or cosine distance, is used to find the neighbors for a given user.  For an item not rated by the user, we look at whether the user's neighbors have rated that item. If they have rated an average of his the neighbors' ratings are considered as the ratings for this user.

Let us prepare the data:

set.seed(100)
data <- sample(Jester5k, 1500)
plan <- evaluationScheme(data, method="split", train=0.9, given = 10, goodRating=1)
train <- getData(plan, "train")
test <- getData(plan, "unknown")
test.known <- getData(plan, "known")

With our input data prepared, let us proceed to build the model.

Building a user-based model looks like this:

> plan <- evaluationScheme(data, method="cross", train=0.9, given = 10, goodRating=5)
> results <- evaluate(plan, method = "UBCF", type = "topNList", n = c(5,10,15) )
UBCF run fold/sample [model time/prediction time]
1 [0.017sec/0.268sec]
2 [0.016sec/0.267sec]
3 [0.01sec/0.284sec]
4 [0.01sec/0.273sec]
5 [0.009sec/0.273sec]
6 [0.009sec/0.272sec]
7 [0.009sec/0.508sec]
8 [0.009sec/0.236sec]
9 [0.009sec/0.268sec]
10 [0.01sec/0.262sec]
> avg(results)
TP FP FN TN precision recall TPR FPR
5 2.024000 2.976000 14.40600 70.59400 0.4048000 0.1586955 0.1586955 0.03877853
10 3.838667 6.161333 12.59133 67.40867 0.3838667 0.2888018 0.2888018 0.08048999
15 5.448000 9.552000 10.98200 64.01800 0.3632000 0.3987303 0.3987303 0.12502479
>

Using our framework defined in the previous model, we can create a user-based recommendation system, as shown in the previous code, and evaluate its performance. We have used the cross-validation scheme to evaluate our model's performance.

We call the evaluate method with our cross-validation scheme and specify the user-based model by the parameter method. UBCF stands for user-based recommendation. Once again we are interested only in the top N recommendation and our N is now an array of three values: 5, 10, and 15. We want to evaluate our model for all the three Ns. Therefore, we have passed an array of values. Finally, when we see the model performance using the results object, the performance is averaged across 10 models, for all the three Ns we have supplied.

An alternate way to evaluate the model is to make the model do the recommendation for all the unknown items in the test data. Now compare the difference between the predicted and actual ratings and show a metric, such as root mean square error, or mean absolute error or squared error.

Find all the ratings:

> results.1 <- evaluate(plan, method ="UBCF", type ="ratings")
UBCF run fold/sample [model time/prediction time]
1 [0.01sec/0.395sec]
2 [0.011sec/0.223sec]
3 [0.01sec/0.227sec]
4 [0.011sec/0.247sec]
5 [0.01sec/0.221sec]
6 [0.009sec/0.213sec]
7 [0.013sec/0.247sec]
8 [0.009sec/0.401sec]
9 [0.011sec/0.242sec]
10 [0.009sec/0.243sec]
> avg(results.1)
RMSE MSE MAE
res 4.559954 20.80655 3.573544
>

We can further improve the model by changing the parameters:

param=list(normalize="center",method="Pearson",nn=10)

Here we are saying for our user-based model, we need to normalize the data. Sine the scale of the rating is same across the users ( from -10 to +10) we want to only bring the data to a zero mean. Further, we want to use the Pearson coefficient as our distance measure. Finally, we want to use 10 neighbors to get our recommendation.

Also since we have 100 jokes, we can have around 30 jokes in our test.known.

Let us make these changes and evaluate our model:

> plan <- evaluationScheme(data, method="cross", train=0.9, given = 30, goodRating=5)
> results.1 <- evaluate(plan, method ="UBCF", param=param, type ="ratings")
UBCF run fold/sample [model time/prediction time]
1 [0.01sec/0.223sec]
2 [0.014sec/0.23sec]
3 [0.008sec/0.402sec]
4 [0.009sec/0.24sec]
5 [0.011sec/0.245sec]
6 [0.01sec/0.233sec]
7 [0.009sec/0.227sec]
8 [0.009sec/0.232sec]
9 [0.009sec/0.209sec]
10 [0.014sec/0.218sec]
> avg(results.1)
RMSE MSE MAE
res 4.427301 19.61291 3.503708

We can see that our RMSE has gone down. Similarly, we can make some more changes and tune our model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset