Introducing content-based recommendation

To understand the inner workings of a content-based recommendation system, let's look at a simple example. We will use the wine dataset from https://archive.ics.uci.edu/ml/datasets/wine

This dataset is the result of the chemical analysis of wine grown in the same region in Italy. We have data from three different cultivars (From an assemblage of plants selected for desirable characters, Wikipedia: https://en.wikipedia.org/wiki/Cultivar).

Let's extract the data from UCI machine learning repository:

> library(data.table)
> wine.data <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data')
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 10782 100 10782 0 0 22336 0 --:--:-- --:--:-- --:--:-- 22369

> head(wine.data)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1: 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
2: 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
3: 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
4: 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
5: 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
6: 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

We have a total of 14 columns.

The column number 1, named V1 represents the cultivar:

> table(wine.data$V1)

1 2 3
59 71 48
>

We see the distribution of the V1 column.

Let's remove the cultivar and only retain the chemical properties of the wine:

> wine.type <- wine.data[,1]
> wine.features <- wine.data[,-1]

wine.features has all the properties and without the cultivar column.

Let's scale this wine.features and create a matrix:

wine.features.scaled <- data.frame(scale(wine.features))
wine.mat <- data.matrix(wine.features.scaled)

We have converted our data frame to a matrix.

Let's add the row names and give an integer number for each wine:

> rownames(wine.mat) <- seq(1:dim(wine.features.scaled)[1])
> wine.mat[1:2,]
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1.5143408 -0.5606682 0.2313998 -1.166303 1.90852151 0.8067217 1.0319081 -0.6577078 1.2214385 0.2510088
2 0.2455968 -0.4980086 -0.8256672 -2.483841 0.01809398 0.5670481 0.7315653 -0.8184106 -0.5431887 -0.2924962
V12 V13 V14
1 0.3611585 1.842721 1.0101594
2 0.4049085 1.110317 0.9625263
>

With our matrix ready, let's find the similarity between the wines.

We have numbered our rows representing the wine. The columns represent the properties of the wine.

We are going to use the pearson coefficient to find the similarities.

The pearson coefficient measures the correlation between two variables:

cov() is the covariance, and it's divided by the standard deviation of x and standard deviation of y.

In our case, we want to find the pearson coefficient between the rows. We want the similarity between two wines. Hence we will transpose our matrix before invoking cor function.

Let's find the similarity matrix:

> wine.mat <- t(wine.mat)

> cor.matrix <- cor(wine.mat, use = "pairwise.complete.obs", method = "Pearson")

> dim(cor.matrix)
[1] 178 178

> cor.matrix[1:5,1:5]
1 2 3 4 5
1 1.0000000 0.7494842 0.5066551 0.7244043066 0.1850897291
2 0.7494842 1.0000000 0.4041662 0.6896539740 -0.1066822182
3 0.5066551 0.4041662 1.0000000 0.5985843958 0.1520360593
4 0.7244043 0.6896540 0.5985844 1.0000000000 -0.0003942683
5 0.1850897 -0.1066822 0.1520361 -0.0003942683 1.0000000000

We transpose the wine.mat matrix and pass it to the cor function. In the transposed matrix, our output will be the similarity between the different wines.

The cor.matrix matrix is the similarity matrix, which shows how closely related items are. The values range from -1 for perfect negative correlation, when two items have attributes that move in opposite directions, and +1 for perfect positive correlation, when attributes for the two items move in the same direction. For example, in row 1, wine 1 is more similar to wine 2 than wine 3.  The diagonal values will be +1, as we are comparing a wine to itself.

Let's do a small recommendation test:

> user.view <- wine.features.scaled[3,]

> user.view
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
3 0.1963252 0.02117152 1.106214 -0.2679823 0.08810981 0.8067217 1.212114 -0.497005 2.129959 0.2682629
V12 V13 V14
3 0.3174085 0.7863692 1.391224

Let's a say a particular user is either tasting or looking at the properties of wine 3. We want to recommend him wines similar to wine 3.

Let's do the recommendation:

> sim.items <- cor.matrix[3,]

> sim.items
1 2 3 4 5 6 7 8 9
0.50665507 0.40416617 1.00000000 0.59858440 0.15203606 0.54025182 0.57579895 0.18210803 0.42398729
10 11 12 13 14 15 16 17 18
0.55472235 0.66895949 0.40555308 0.61365843 0.57899194 0.73254986 0.36166695 0.44423273 0.28583467
19 20 21 22 23 24 25 26 27
0.49034236 0.44071794 0.37793495 0.45685238 0.48065399 0.52503055 0.41103595 0.04497370 0.56095748
28 29 30 31 32 33 34 35 36
0.38265553 0.36399501 0.53896624 0.70081585 0.61082768 0.37118102 -0.08388356 0.41537403 0.57819928
37 38 39 40 41 42 43 44 45
0.33457904 0.50516170 0.34839907 0.34398394 0.52878458 0.17497055 0.63598084 0.10647749 0.54740222
46 47 48 49 50 51 52 53 54
-0.02744663 0.48876356 0.59627672 0.68698418 0.48261764 0.76062564 0.77192733 0.50767052 0.41555689.....

We look at the third row in our similarity matrix. We know that the similarity matrix has stored all the item similarities. So the third row gives us the similarity score between wine 3 and all the other wines. The preceding results are truncated.

We want to find the closest match:

> sim.items.sorted <- sort(sim.items, decreasing = TRUE)

> sim.items.sorted[1:5]
3 52 51 85 15
1.0000000 0.7719273 0.7606256 0.7475886 0.7325499
>

First, we sort row 3 in decreasing order, so we have all the items close to wine 3 popping to the front. Then we pull out the top five matches. Great--we want to recommend wines 52, 51, 85, and 15 to this user. We ignore the first recommendation as it will be the same item we are searching for. In this case, the first element will be wine 3 with a similarity score of 1.0.

Let's look at the properties of wine 3 and the top five matches to confirm our recommendation:

> rbind(wine.data[3,]
+ ,wine.data[52,]
+ ,wine.data[51,]
+ ,wine.data[85,]
+ ,wine.data[15,]
+ )
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1: 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
2: 1 13.83 1.65 2.60 17.2 94 2.45 2.99 0.22 2.29 5.60 1.24 3.37 1265
3: 1 13.05 1.73 2.04 12.4 92 2.72 3.27 0.17 2.91 7.20 1.12 2.91 1150
4: 2 11.84 0.89 2.58 18.0 94 2.20 2.21 0.22 2.35 3.05 0.79 3.08 520
5: 1 14.38 1.87 2.38 12.0 102 3.30 3.64 0.29 2.96 7.50 1.20 3.00 1547
>

Great—you can see that the wine properties in our recommendation are close to the properties of wine 3.

Hopefully, this explains the concept of content-based recommendation. Without any information about the user, based on the product he was browsing, we were able to make recommendations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset