Use case and data

The Jester5k dataset is what we will be using to build our recommender system using collaborative filtering. It contains user ratings in the scale of -10 to 10 for several jokes. In this chapter, we will use these ratings as an input and produce ratings for jokes which the users have not seen or not rated before. For more information about the Jester5k dataset, visit: http://www.ieor.berkeley.edu/%7Egoldberg/jester-data/

Fortunately, a sample of this data is available with the recommenderlab package.

Let us quickly look at this data:

> library(recommenderlab, quietly = TRUE)
> data("Jester5k")

> str(Jester5k)
Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
  ..@ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:362106] 0 1 2 3 4 5 6 7 8 9 ...
  .. .. ..@ p       : int [1:101] 0 3314 6962 10300 13442 18440 22513 27512 32512 35685 ...
  .. .. ..@ Dim     : int [1:2] 5000 100
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:5000] "u2841" "u15547" "u15221" "u15573" ...
  .. .. .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
  .. .. ..@ x       : num [1:362106] 7.91 -3.2 -1.7 -7.38 0.1 0.83 2.91 -2.77 -3.35 -1.99 ...
  .. .. ..@ factors : list()
  ..@ normalize: NULL

> head(Jester5k@data[1:5,1:5])
5 x 5 sparse Matrix of class "dgCMatrix"
          j1    j2    j3    j4    j5
u2841   7.91  9.17  5.34  8.16 -8.74
u15547 -3.20 -3.50 -9.56 -8.74 -6.36
u15221 -1.70  1.21  1.55  2.77  5.58
u15573 -7.38 -8.93 -3.88 -7.23 -4.90
u21505  0.10  4.17  4.90  1.55  5.53

The data is stored as realRatingMatrix. We can examine the slots of this matrix by calling the str() function. Looking at the data slot of realRatingMatrix, we see that it's a matrix with 5,000 rows and 100 columns. The rows are the number of users and the columns represent the number of jokes. If all users have rated all jokes, we should have a total of 500,000 ratings. However, we see that we have 362,106 ratings.

As the name of the dataset suggests, Jester5k has 5 k users and their ratings for 100 jokes.

s3, is R's object-oriented system. Refer to http://adv-r.had.co.nz/S3.html to get more details about the s3 object system.

Further, let's look at the first five rows and columns of realRatingMatrix to have a quick look at the ratings.

Let us look at the ratings provided by two random users, 1 and 100:

> Jester5k@data[1,]
   j1    j2    j3    j4    j5    j6    j7    j8    j9   j10   j11   j12   j13   j14   j15   j16   j17   j18 
 7.91  9.17  5.34  8.16 -8.74  7.14  8.88 -8.25  5.87  6.21  7.72  6.12 -0.73  7.77 -5.83 -8.88  8.98 -9.32 
  j19   j20   j21   j22   j23   j24   j25   j26   j27   j28   j29   j30   j31   j32   j33   j34   j35   j36 
-9.08 -9.13  7.77  8.59  5.29  8.25  6.02  5.24  7.82  7.96 -8.88  8.25  3.64 -0.73  8.25  5.34 -7.77 -9.76 
  j37   j38   j39   j40   j41   j42   j43   j44   j45   j46   j47   j48   j49   j50   j51   j52   j53   j54 
 7.04  5.78  8.06  7.23  8.45  9.08  6.75  5.87  8.45 -9.42  5.15  8.74  6.41  8.64  8.45  9.13 -8.79  6.17 
  j55   j56   j57   j58   j59   j60   j61   j62   j63   j64   j65   j66   j67   j68   j69   j70   j71   j72 
 8.25  6.89  5.73  5.73  8.20  6.46  8.64  3.59  7.28  8.25  4.81 -8.20  5.73  7.04  4.56  8.79  0.00  0.00 
  j73   j74   j75   j76   j77   j78   j79   j80   j81   j82   j83   j84   j85   j86   j87   j88   j89   j90 
 0.00  0.00  0.00  0.00 -9.71  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 
  j91   j92   j93   j94   j95   j96   j97   j98   j99  j100 
 7.57 -9.42 -9.27  7.62  7.77  8.20  6.60  7.33  9.17  8.88 

> Jester5k@data[100,]
   j1    j2    j3    j4    j5    j6    j7    j8    j9   j10   j11   j12   j13   j14   j15   j16   j17   j18 
-2.48  3.93  2.72 -2.67  1.75  3.35  0.73 -0.53 -0.58  3.88  3.16  1.17  0.53  1.65  1.26 -4.08 -0.49 -3.79 
  j19   j20   j21   j22   j23   j24   j25   j26   j27   j28   j29   j30   j31   j32   j33   j34   j35   j36 
-3.06 -2.33  3.59  0.58  0.39  0.53  2.38 -0.05  2.43 -0.34  3.35  2.04  2.33  3.54 -0.19 -0.24  2.62  3.83 
  j37   j38   j39   j40   j41   j42   j43   j44   j45   j46   j47   j48   j49   j50   j51   j52   j53   j54 
-2.52  5.19  1.75  0.00  0.39  1.75 -3.64 -2.28  2.33  3.16 -2.48  0.19  2.82  4.22 -0.19  3.30 -0.53  3.45 
  j55   j56   j57   j58   j59   j60   j61   j62   j63   j64   j65   j66   j67   j68   j69   j70   j71   j72 
-0.53  0.97 -2.91 -8.25 -0.29  2.52  4.66  3.50 -0.24  3.64 -0.05  1.21 -3.25  1.17 -2.57 -2.18 -5.44  2.67 
  j73   j74   j75   j76   j77   j78   j79   j80   j81   j82   j83   j84   j85   j86   j87   j88   j89   j90 
 2.57 -4.03  2.96  3.40  1.12  1.36 -3.01  2.96  2.04 -3.25  1.94 -3.40 -3.50 -3.45 -3.06  2.04  3.20  3.06 
  j91   j92   j93   j94   j95   j96   j97   j98   j99  j100 
 2.86 -5.15  3.01  0.83 -6.21 -6.60 -6.31  3.69 -4.22  0.97 
>

As we can see in the data, users don't have ratings for all the jokes. The two random users we have just shown have zero as the rating for some of the jokes.

A user will not have rated all the jokes. If they have not rated it, there will be a zero value. To get the number of jokes a user rated, we can run:

length(Jester5k@data[100,][Jester5k@data[100,]>0]) # answer = 58

Let us dig a little bit deeper to see this zero.ratings distribution:

> zero.ratings <- rowSums(Jester5k@data == 0)
> zero.ratings.df <- data.frame("user" = names(zero.ratings), "count" = zero.ratings)
> head(zero.ratings.df)
         user count
u2841   u2841    19
u15547 u15547    29
u15221 u15221     0
u15573 u15573     0
u21505 u21505    28
u15994 u15994     1

> head(zero.ratings.df[order(-zero.ratings.df$count),], 10)
         user count
u3228   u3228    66
u5768   u5768    65
u10701 u10701    65
u7533   u7533    65
u19356 u19356    65
u7155   u7155    65
u7786   u7786    65
u7161   u7161    65
u15037 u15037    65
u7904   u7904    64
>

We are looking to see per user the count of jokes he has not rated. We can achieve this by doing the sum of rows of our ratings matrix. After summing it up, we create a dataframe, zero.ratings.df, with two columns; the first column is the user and the second column is the number of zero-entries they have in the ratings matrix, that is, the number of jokes where their ratings were zero. Further, we can order our dataframe zero ratings in descending order by the count. We can see that user u3228 has not rated 66 jokes.

Let us use this data to make a histogram to see the underlying distribution:

> hist(zero.ratings.df$count, main ="Distribution of zero rated jokes")

The histogram, shows the Distribution of zero rated jokes:

The histogram is showing the count of the zeros (that is, unrated jokes), so a low value on the x-axis means that users have rated more jokes. The bin 0-5 is a testimony to it. Out of 100 jokes, anything between 0 to 5 is left unrated. However, the distribution looks to have three modes.

A density plot may illustrate this more visually:

> zero.density <- density(zero.ratings.df$count)
> plot(zero.density)

The three modes are evident from the density plot. We have three groups of users in our database. Those who have a low number for unrated jokes, another group which has around 25 jokes unrated and the final group which has around 65 jokes unrated.

It's a good practice to have an overview of the underlying distribution of the data. In this case, it may be that we want to build three different models based on which group the user falls in.

We can further verify this empirically by using a clustering algorithm.

Let us use k-means to do an empirical verification:

> model <- kmeans(zero.ratings.df$count,3 )
> model$centers
       [,1]
1 54.845633
2  1.358769
3 29.366702
> model$size
[1] 1477 1625 1898
> model.df <- data.frame(centers = model$centers, size = model$size, perc = (model$size / 5000) * 100)
> head(model.df)
    centers size  perc
1 54.845633 1477 29.54
2  1.358769 1625 32.50
3 29.366702 1898 37.96

We can fact check our user clusters by running a k-means algorithm on our data. We set the parameter k to the number of clusters and finally collect our results in a dataframe named model.df. The cluster centers reflect the number of jokes not rated. More information about R k-means can be found at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

Now that we've looked at the user distribution, let us proceed to look at the joke's ratings:

> Jester5k@data[,1]
 u2841 u15547 u15221 u15573 u21505 u15994   u238  u5809 u16636 u12843 u17322 u13610  u7061 u23059  u7299 
  7.91  -3.20  -1.70  -7.38   0.10   0.83   2.91  -2.77  -3.35  -1.99  -0.68   9.17  -9.71  -3.16   5.58 
u20906  u7147  u6662  u4662  u5798  u7904  u7556  u3970   u999  u5462 u20231 u13120 u22827 u20747  u1143 
  9.08   0.00  -6.70   0.00   1.02   0.00  -3.01   5.87  -7.33   0.00   7.48   0.00  -9.71   0.00   0.00 
u11381  u6617  u7602 u12658  u4519 u18953  u5021  u6457 u24750 u20139 u13802 u16123  u7778 u15509  u8225 
  0.00   6.55   0.00   0.00   0.78  -0.10   0.00   0.00   0.00  -6.65   2.28   1.02   0.00  -8.35   5.53 
u12519 u16885 u12094  u6083 u19086  u1840  u7722 u17883 u12579  u3815 u12563 u12313 u18725  u4354 u21146 
  ....................
 [ reached getOption("max.print") -- omitted 4000 entries ]
> 
> par(mfrow=c(2,2))
> joke.density <- density(Jester5k@data[,1][Jester5k@data[,1]!=0])
> plot(joke.density)

> joke.density <- density(Jester5k@data[,25][Jester5k@data[,25]!=0])
> plot(joke.density)

> joke.density <- density(Jester5k@data[,75][Jester5k@data[,75]!=0])
> plot(joke.density)

> joke.density <- density(Jester5k@data[,100]
  [Jester5k@data[,100]!=100])
> plot(joke.density)

We look at the first joke, Jester5k@data[,1], for its scoring values; the output shown in the preceding code is truncated.

Further, we plot the density plot for four randomly selected jokes (1, 25, 75, and 100) and look at the distribution of scores.

The distribution graph is illustrated in the following figure:

For all the four jokes, we see the rating is more than zero. The recommenderlab package provides a function, getRatings, which can work on the s3 object to retrieve the ratings.

Let us look at all of the getRatings function in our dataset:

hist(getRatings(Jester5k), main="Distribution of ratings")

The ratings distribution plot is shown in the following figure:

Let us now move on to see if we can find the most popular joke.

The R snippet to find the most popular joke is shown here:

> ratings.binary <- binarize(Jester5k, minRating =0)
> ratings.binary
5000 x 100 rating matrix of class 'binaryRatingMatrix' with 215798 ratings.
> ratings.sum <- colSums(ratings.binary)
> ratings.sum.df <- data.frame(joke = names(ratings.sum), pratings = ratings.sum)
> head( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
    joke pratings
j50  j50     4081
j36  j36     4021
j32  j32     3914
j35  j35     3853
j27  j27     3846
j53  j53     3843
j29  j29     3820
j62  j62     3814
j49  j49     3762
j68  j68     3713
> 
> tail( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
    joke pratings
j80  j80     1072
j90  j90     1057
j73  j73     1041
j77  j77     1012
j86  j86      994
j79  j79      934
j75  j75      895
j71  j71      796
j58  j58      695
j74  j74      689
>

We begin with binarizing our ratings matrix. In the new matrix, ratings.binary , the ratings with 0 or more will be considered as positive ratings and all the others will be considered as negative ratings. We create a dataframe, ratings.sum.df, with two columns: the joke name and the sum of the ratings received by the joke. Since we have binarized the matrix, this sum should be equal to the popularity of a joke. Displaying the matrix in descending order of the sum, we see the most popular jokes and the least popular jokes.

Finally, we are going to sample the datasets for 1,500 users and use that as our dataset for the rest of the chapter. Sampling is often used during exploration, by taking a smaller subset of the data, you can explore the data and produce the initial models quicker. Then when the time comes, you can apply the best approach to the entire dataset

Sampling the dataset looks like this:

data <- sample(Jester5k, 1500)
hist(getRatings(data), main="Distribution of ratings for 1500 users")

The following is the image for the distribution of ratings for 1500 users:

Hopefully, the data exploration we performed in this section has given you a good overview of the underlying dataset. Let us proceed now to build our joke recommendation system.

Table of Contents for Use case and data

Create new playlist

Sign In

Sign Up

Table of Contents for
Use case and data