Use case and data

The Jester5k dataset is what we will be using to build our recommender system using collaborative filtering. It contains user ratings in the scale of -10 to 10 for several jokes. In this chapter, we will use these ratings as an input and produce ratings for jokes which the users have not seen or not rated before. For more information about the Jester5k dataset, visit: http://www.ieor.berkeley.edu/%7Egoldberg/jester-data/

Fortunately, a sample of this data is available with the recommenderlab package.

Let us quickly look at this data:

> library(recommenderlab, quietly = TRUE)
> data("Jester5k")

> str(Jester5k)
Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..@ i : int [1:362106] 0 1 2 3 4 5 6 7 8 9 ...
.. .. ..@ p : int [1:101] 0 3314 6962 10300 13442 18440 22513 27512 32512 35685 ...
.. .. ..@ Dim : int [1:2] 5000 100
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:5000] "u2841" "u15547" "u15221" "u15573" ...
.. .. .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
.. .. ..@ x : num [1:362106] 7.91 -3.2 -1.7 -7.38 0.1 0.83 2.91 -2.77 -3.35 -1.99 ...
.. .. ..@ factors : list()
..@ normalize: NULL

> head(Jester5k@data[1:5,1:5])
5 x 5 sparse Matrix of class "dgCMatrix"
j1 j2 j3 j4 j5
u2841 7.91 9.17 5.34 8.16 -8.74
u15547 -3.20 -3.50 -9.56 -8.74 -6.36
u15221 -1.70 1.21 1.55 2.77 5.58
u15573 -7.38 -8.93 -3.88 -7.23 -4.90
u21505 0.10 4.17 4.90 1.55 5.53

The data is stored as realRatingMatrix. We can examine the slots of this matrix by calling the str() function. Looking at the data slot of realRatingMatrix, we see that it's a matrix with 5,000 rows and 100 columns. The rows are the number of users and the columns represent the number of jokes. If all users have rated all jokes, we should have a total of 500,000 ratings. However, we see that we have 362,106 ratings.

As the name of the dataset suggests, Jester5k has 5 k users and their ratings for 100 jokes.

s3, is R's object-oriented system. Refer to http://adv-r.had.co.nz/S3.html to get more details about the s3 object system.

Further, let's look at the first five rows and columns of realRatingMatrix to have a quick look at the ratings.

Let us look at the ratings provided by two random users, 1 and 100:

> Jester5k@data[1,]
j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12 j13 j14 j15 j16 j17 j18
7.91 9.17 5.34 8.16 -8.74 7.14 8.88 -8.25 5.87 6.21 7.72 6.12 -0.73 7.77 -5.83 -8.88 8.98 -9.32
j19 j20 j21 j22 j23 j24 j25 j26 j27 j28 j29 j30 j31 j32 j33 j34 j35 j36
-9.08 -9.13 7.77 8.59 5.29 8.25 6.02 5.24 7.82 7.96 -8.88 8.25 3.64 -0.73 8.25 5.34 -7.77 -9.76
j37 j38 j39 j40 j41 j42 j43 j44 j45 j46 j47 j48 j49 j50 j51 j52 j53 j54
7.04 5.78 8.06 7.23 8.45 9.08 6.75 5.87 8.45 -9.42 5.15 8.74 6.41 8.64 8.45 9.13 -8.79 6.17
j55 j56 j57 j58 j59 j60 j61 j62 j63 j64 j65 j66 j67 j68 j69 j70 j71 j72
8.25 6.89 5.73 5.73 8.20 6.46 8.64 3.59 7.28 8.25 4.81 -8.20 5.73 7.04 4.56 8.79 0.00 0.00
j73 j74 j75 j76 j77 j78 j79 j80 j81 j82 j83 j84 j85 j86 j87 j88 j89 j90
0.00 0.00 0.00 0.00 -9.71 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
j91 j92 j93 j94 j95 j96 j97 j98 j99 j100
7.57 -9.42 -9.27 7.62 7.77 8.20 6.60 7.33 9.17 8.88

> Jester5k@data[100,]
j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12 j13 j14 j15 j16 j17 j18
-2.48 3.93 2.72 -2.67 1.75 3.35 0.73 -0.53 -0.58 3.88 3.16 1.17 0.53 1.65 1.26 -4.08 -0.49 -3.79
j19 j20 j21 j22 j23 j24 j25 j26 j27 j28 j29 j30 j31 j32 j33 j34 j35 j36
-3.06 -2.33 3.59 0.58 0.39 0.53 2.38 -0.05 2.43 -0.34 3.35 2.04 2.33 3.54 -0.19 -0.24 2.62 3.83
j37 j38 j39 j40 j41 j42 j43 j44 j45 j46 j47 j48 j49 j50 j51 j52 j53 j54
-2.52 5.19 1.75 0.00 0.39 1.75 -3.64 -2.28 2.33 3.16 -2.48 0.19 2.82 4.22 -0.19 3.30 -0.53 3.45
j55 j56 j57 j58 j59 j60 j61 j62 j63 j64 j65 j66 j67 j68 j69 j70 j71 j72
-0.53 0.97 -2.91 -8.25 -0.29 2.52 4.66 3.50 -0.24 3.64 -0.05 1.21 -3.25 1.17 -2.57 -2.18 -5.44 2.67
j73 j74 j75 j76 j77 j78 j79 j80 j81 j82 j83 j84 j85 j86 j87 j88 j89 j90
2.57 -4.03 2.96 3.40 1.12 1.36 -3.01 2.96 2.04 -3.25 1.94 -3.40 -3.50 -3.45 -3.06 2.04 3.20 3.06
j91 j92 j93 j94 j95 j96 j97 j98 j99 j100
2.86 -5.15 3.01 0.83 -6.21 -6.60 -6.31 3.69 -4.22 0.97
>

As we can see in the data, users don't have ratings for all the jokes. The two random users we have just shown have zero as the rating for some of the jokes.

A user will not have rated all the jokes. If they have not rated it, there will be a zero value. To get the number of jokes a user rated, we can run:

length(Jester5k@data[100,][Jester5k@data[100,]>0]) # answer = 58

Let us dig a little bit deeper to see this zero.ratings distribution:

> zero.ratings <- rowSums(Jester5k@data == 0)
> zero.ratings.df <- data.frame("user" = names(zero.ratings), "count" = zero.ratings)
> head(zero.ratings.df)
user count
u2841 u2841 19
u15547 u15547 29
u15221 u15221 0
u15573 u15573 0
u21505 u21505 28
u15994 u15994 1

> head(zero.ratings.df[order(-zero.ratings.df$count),], 10)
user count
u3228 u3228 66
u5768 u5768 65
u10701 u10701 65
u7533 u7533 65
u19356 u19356 65
u7155 u7155 65
u7786 u7786 65
u7161 u7161 65
u15037 u15037 65
u7904 u7904 64
>

We are looking to see per user the count of jokes he has not rated. We can achieve this by doing the sum of rows of our ratings matrix. After summing it up, we create a dataframe, zero.ratings.df, with two columns; the first column is the user and the second column is the number of zero-entries they have in the ratings matrix, that is, the number of jokes where their ratings were zero. Further, we can order our dataframe zero ratings in descending order by the count. We can see that user u3228 has not rated 66 jokes.

Let us use this data to make a histogram to see the underlying distribution:

> hist(zero.ratings.df$count, main ="Distribution of zero rated jokes")

The histogram, shows the Distribution of zero rated jokes:

The histogram is showing the count of the zeros (that is, unrated jokes), so a low value on the x-axis means that users have rated more jokes. The bin 0-5 is a testimony to it. Out of 100 jokes, anything between 0 to 5 is left unrated. However, the distribution looks to have three modes.

A density plot may illustrate this more visually:

> zero.density <- density(zero.ratings.df$count)
> plot(zero.density)

The three modes are evident from the density plot. We have three groups of users in our database. Those who have a low number for unrated jokes, another group which has around 25 jokes unrated and the final group which has around 65 jokes unrated.

It's a good practice to have an overview of the underlying distribution of the data. In this case, it may be that we want to build three different models based on which group the user falls in.

We can further verify this empirically by using a clustering algorithm.

Let us use k-means to do an empirical verification:

> model <- kmeans(zero.ratings.df$count,3 )
> model$centers
[,1]
1 54.845633
2 1.358769
3 29.366702
> model$size
[1] 1477 1625 1898
> model.df <- data.frame(centers = model$centers, size = model$size, perc = (model$size / 5000) * 100)
> head(model.df)
centers size perc
1 54.845633 1477 29.54
2 1.358769 1625 32.50
3 29.366702 1898 37.96

We can fact check our user clusters by running a k-means algorithm on our data. We set the parameter k to the number of clusters and finally collect our results in a dataframe named model.df.  The cluster centers reflect the number of jokes not rated.  More information about R k-means can be found at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

Now that we've looked at the user distribution, let us proceed to look at the joke's ratings:

> Jester5k@data[,1]
u2841 u15547 u15221 u15573 u21505 u15994 u238 u5809 u16636 u12843 u17322 u13610 u7061 u23059 u7299
7.91 -3.20 -1.70 -7.38 0.10 0.83 2.91 -2.77 -3.35 -1.99 -0.68 9.17 -9.71 -3.16 5.58
u20906 u7147 u6662 u4662 u5798 u7904 u7556 u3970 u999 u5462 u20231 u13120 u22827 u20747 u1143
9.08 0.00 -6.70 0.00 1.02 0.00 -3.01 5.87 -7.33 0.00 7.48 0.00 -9.71 0.00 0.00
u11381 u6617 u7602 u12658 u4519 u18953 u5021 u6457 u24750 u20139 u13802 u16123 u7778 u15509 u8225
0.00 6.55 0.00 0.00 0.78 -0.10 0.00 0.00 0.00 -6.65 2.28 1.02 0.00 -8.35 5.53
u12519 u16885 u12094 u6083 u19086 u1840 u7722 u17883 u12579 u3815 u12563 u12313 u18725 u4354 u21146
....................
[ reached getOption("max.print") -- omitted 4000 entries ]
>
> par(mfrow=c(2,2))
> joke.density <- density(Jester5k@data[,1][Jester5k@data[,1]!=0])
> plot(joke.density)

> joke.density <- density(Jester5k@data[,25][Jester5k@data[,25]!=0])
> plot(joke.density)

> joke.density <- density(Jester5k@data[,75][Jester5k@data[,75]!=0])
> plot(joke.density)

> joke.density <- density(Jester5k@data[,100]
[Jester5k@data[,100]!=100])

> plot(joke.density)

We look at the first joke, Jester5k@data[,1], for its scoring values; the output shown in the preceding code is truncated.

Further, we plot the density plot for four randomly selected jokes (1, 25, 75, and 100) and look at the distribution of scores.

The distribution graph is illustrated in the following figure:

For all the four jokes, we see the rating is more than zero. The recommenderlab package provides a function, getRatings, which can work on the s3 object to retrieve the ratings.

Let us look at all of the getRatings function in our dataset:

hist(getRatings(Jester5k), main="Distribution of ratings")

The ratings distribution plot is shown in the following figure:

Let us now move on to see if we can find the most popular joke.

The R snippet to find the most popular joke is shown here:

> ratings.binary <- binarize(Jester5k, minRating =0)
> ratings.binary
5000 x 100 rating matrix of class 'binaryRatingMatrix' with 215798 ratings.
> ratings.sum <- colSums(ratings.binary)
> ratings.sum.df <- data.frame(joke = names(ratings.sum), pratings = ratings.sum)
> head( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
joke pratings
j50 j50 4081
j36 j36 4021
j32 j32 3914
j35 j35 3853
j27 j27 3846
j53 j53 3843
j29 j29 3820
j62 j62 3814
j49 j49 3762
j68 j68 3713
>
> tail( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
joke pratings
j80 j80 1072
j90 j90 1057
j73 j73 1041
j77 j77 1012
j86 j86 994
j79 j79 934
j75 j75 895
j71 j71 796
j58 j58 695
j74 j74 689
>

We begin with binarizing our ratings matrix. In the new matrix, ratings.binary , the ratings with 0 or more will be considered as positive ratings and all the others will be considered as negative ratings. We create a dataframe, ratings.sum.df, with two columns: the joke name and the sum of the ratings received by the joke. Since we have binarized the matrix, this sum should be equal to the popularity of a joke. Displaying the matrix in descending order of the sum, we see the most popular jokes and the least popular jokes.

Finally, we are going to sample the datasets for 1,500 users and use that as our dataset for the rest of the chapter. Sampling is often used during exploration, by taking a smaller subset of the data, you can explore the data and produce the initial models quicker. Then when the time comes, you can apply the best approach to the entire dataset

Sampling the dataset looks like this:

data <- sample(Jester5k, 1500)
hist(getRatings(data), main="Distribution of ratings for 1500 users")

The following is the image for the distribution of ratings for 1500 users:

Hopefully, the data exploration we performed in this section has given you a good overview of the underlying dataset. Let us proceed now to build our joke recommendation system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset