Building a recommender engine

As discussed in the previous section, collaborative filtering is a simple yet very effective approach for predicting and recommending items to users. If we look closely, the algorithms work on input data, which is nothing but a matrix representation of the user ratings for different products.

Bringing in a mathematical perspective into the picture, matrix factorization is a technique to manipulate matrices and identify latent or hidden features from the data represented in the matrix. Building on the same concept, let us use matrix factorization as the basis for predicting ratings for items which the user has not yet rated.

Matrix factorization

Matrix factorization refers to the identification of two or more matrices such that when these matrices are multiplied we get the original matrix. Matrix factorization, as mentioned earlier, can be used to discover latent features between two different kinds of entities. We will understand and use the concepts of matrix factorization as we go along preparing our recommender engine for our e-commerce platform.

As our aim for the current project is to personalize the shopping experience and recommend product ratings for an e-commerce platform, our input data contains user ratings for various products on the website. We process the input data and transform it into a matrix representation for analyzing it using matrix factorization. The input data looks like this:

Matrix factorization

User ratings matrix

As you can see, the input data is a matrix with each row representing a particular user's rating for different items represented in the columns. For the current case, the columns representing items are different mobile phones such as iPhone 4, iPhone 5s, Nexus 5, and so on. Each row contains ratings for each of these mobile phones as given by eight different users. The ratings range from 1 to 5 with 1 being the lowest and 5 being the highest. A rating of 0 represents unrated items or missing rating.

The task of our recommender engine will be to predict the correct rating for the missing ones in the input matrix. We could then use the predicted ratings to recommend items most desired by the user.

The premise here is that two users would rate a product similarly if they like similar features of the product or item. Since our current data is related to user ratings for different mobile phones, people might rate the phones based on their hardware configuration, price, OS, and so on. Hence, matrix factorization tries to identify these latent features to predict ratings for a certain user and a certain product.

While trying to identify these latent features, we proceed with the basic assumption that the number of such features is less than the total number of items in consideration. This assumption makes sense because if this was the case, then each user would have a specific feature associated with him/her (and similarly for the product). This would in turn make recommendations futile as none of the users would be interested in items rated by other users (which is not usually the case).

Now let us get into the mathematical details of matrix factorization and our recommender engine.

Since we are dealing with user ratings for different products, let us assume U to be a matrix representing user preferences and similarly a matrix P represents the products for which we have the ratings. Then the ratings matrix R will be defined as Matrix factorization.

Assuming the process helps us identify K latent features, our aim is to find two matrices X and Y such that their product (matrix multiplication) approximates R.

Matrix factorization

Where, X is a user related matrix which represents the associations between the users and the latent features. Y, on the other hand, is the product related matrix which represents the associations between the products and the latent features.

The task of predicting the rating Matrix factorization of a product pj by a user ui is done by calculating the dot product of the vectors corresponding to pj (vector Y, that is the user) and ui (vector X, that is the product).

Matrix factorization

Now, to find the matrices X and Y, we utilize a technique called gradient descent. Gradient descent, in simple terms, tries to find the local minimum of a function; it is an optimization technique. We use gradient descent in the current context to iteratively minimize the difference between the predicted ratings and the actual ratings. To begin with, we randomly initialize the matrices X and Y and then calculate how different their product is from the actual ratings matrix R.

The difference between the predicted and the actual values is what is termed the error. For our problem, we will consider the squared error, which is calculated as:

Matrix factorization

Where, rij is the actual rating by user i for product j and Matrix factorization is the predicted value of the same.

To minimize the error, we need to find the correct direction or gradient to change our values to. To obtain the gradient for each of the variables x and y, we differentiate them separately as:

Matrix factorization

Hence, the equations to find xik and ykj can be given as:

Matrix factorization

Where α is the constant to denote the rate of descent or the rate of approaching the minima (also known as the learning rate). The value of α defines the size of steps we take in either direction to reach the minima. Large values may lead to oscillations as we may overshoot the minima every time. Usual practice is to select very small values for α, of the order 10-4. Matrix factorization and Matrix factorization are the updated values of xik and ykj after each iteration of gradient descent.

As seen in Chapter 2, Let's Help Machines Learn, machine learning algorithms can suffer from overfitting. To avoid overfitting, along with controlling extreme or large values in the matrices X and Y, we introduce the concept of regularization. Formally, regularization refers to the process of introducing additional information in order to prevent overfitting. Regularization penalizes models with extreme values.

To prevent overfitting in our case, we introduce the regularization constant called β. With the introduction of β, the equations are updated as follows:

Matrix factorization

Also,

Matrix factorization
Matrix factorization

As we already have the ratings matrix R and we use it to determine how far our predicted values are from the actual, matrix factorization turns into a supervised learning problem. For this supervised problem, just as we saw in Chapter 2, Let's Help Machines Learn, we use some of the rows as our training samples. Let S be our training set with elements being tuples of the form (ui, pj, rij). Thus, our task is to minimize the error (eij) for every tuple (ui, pj, rij) є in training set S.

The overall error (say E) can be calculated as:

Matrix factorization

Implementation

Now that we have looked into the mathematics of matrix factorization, let us convert the algorithm into code and prepare a recommender engine for the mobile phone ratings input data set discussed earlier.

As shown in the Matrix factorization section, the input dataset is a matrix with each row representing a user's rating for the products mentioned as columns. The ratings range from 1 to 5 with 0 representing the missing values.

To transform our algorithm into working code, we need to compute and complete the following tasks:

  • Load the input data and transform it into ratings matrix representation
  • Prepare a matrix factorization based recommendation model
  • Predict and recommend products to the users
  • Interpret and evaluate the model

Loading and transforming input data into matrix representation is simple. As seen earlier, R provides us with easy to use utility functions for the same.

# load raw ratings from csv
raw_ratings <- read.csv(<file_name>)

# convert columnar data to sparse ratings matrix
ratings_matrix <- data.matrix(raw_ratings)

Now that we have our data loaded into an R matrix, we proceed and prepare the user-latent features matrix X and item-latent features matrix Y. We initialize both from uniform distributions using the runif function.

# number of rows in ratings
rows <- nrow(ratings_matrix)

# number of columns in ratings matrix
columns <- ncol(ratings_matrix)

# latent features
K <- 2

# User-Feature Matrix
X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)

# Item-Feature Matrix
Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)

The major component is the matrix factorization function itself. Let us split the task into two, calculation of the gradient and subsequently the overall error.

The calculation of the gradient involves the ratings matrix R and the two factor matrices X and Y, along with the constants α and β. Since we are dealing with matrix manipulations (specifically, multiplication), we transpose Y before we begin with any further calculations. The following lines of code convert the algorithm discussed previously into R syntax. All variables follow naming convention similar to the algorithm for ease of understanding.

for (i in seq(nrow(ratings_matrix))){

      for (j in seq(length(ratings_matrix[i, ]))){

        if (ratings_matrix[i, j] > 0){

          # error 
          eij = ratings_matrix[i, j] - as.numeric(X[i, ] %*% Y[, j])

       # gradient calculation 

          for (k in seq(K)){
            X[i, k] = X[i, k] + alpha * (2 * eij * Y[k, j]/
            - beta * X[i, k])

            Y[k, j] = Y[k, j] + alpha * (2 * eij * X[i, k]/
            - beta * Y[k, j])
          }
        }
      }
    }

The next part of the algorithm is to calculate the overall error; we again use similar variable names for consistency:

# Overall Squared Error Calculation

e = 0

for (i in seq(nrow(ratings_matrix))){

   for (j in seq(length(ratings_matrix[i, ]))){

     if (ratings_matrix[i, j] > 0){

       e = e + (ratings_matrix[i, j] - /
           as.numeric(X[i, ] %*% Y[, j]))^2

       for (k in seq(K)){
          e = e + (beta/2) * (X[i, k]^2 + Y[k, j]^2)
        }
      }
    }
}

As a final piece, we iterate over these calculations multiple times to mitigate the risks of cold start and sparsity. We term the variable controlling multiple starts as epoch. We also terminate the calculations once the overall error drops below a certain threshold.

Moreover, as we had initialized X and Y from uniform distributions, the predicted values would be real numbers. We round the final output before returning the predicted matrix.

Note that this is a very simplistic implementation and a lot of complexity has been kept out for ease of understanding. Hence, this may result in the predicted matrix containing values greater than 5. For the current scenario, it is safe to assume the values above the max scale of 5 are equivalent to 5 (and similarly for values less than 0). We encourage the reader to fine tune the code to handle such cases.

Setting α to 0.0002, β to 0.02, K (that is, latent features) to 2, and epoch to 1000, let us see a sample run of our code with overall error threshold set to 0.001:

# load raw ratings from csv
raw_ratings <- read.csv("product_ratings.csv")

# convert columnar data to sparse ratings matrix
ratings_matrix <- data.matrix(raw_ratings)


# number of rows in ratings
rows <- nrow(ratings_matrix)

# number of columns in ratings matrix
columns <- ncol(ratings_matrix)

# latent features
K <- 2

# User-Feature Matrix
X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)

# Item-Feature Matrix
Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)

# iterations
epoch <- 10000

# rate of descent
alpha <- 0.0002

# regularization constant
beta <- 0.02


pred.matrix <- mf_based_ucf(ratings_matrix, X, Y, K, epoch = epoch)

# setting column names
colnames(pred.matrix)<-c("iPhone.4","iPhone.5s","Nexus.5","Moto.X","Moto.G","Nexus.6",/"One.Plus.One")

The preceding lines of code utilize the functions explained earlier to prepare the recommendation model. The predicted ratings or the output matrix looks like the following:

Implementation

Predicted ratings matrix

Result interpretation

Let us do a quick visual inspection to see how good or bad our predictions have been. Consider users 1 and 3 as our training samples. From the input dataset, we can clearly see that user 1 has given high ratings to iPhones while user 3 has done the same for Android based phones. The following side by side comparison shows that our algorithm has predicted values close enough to the actual values:

Result interpretation

Ratings by user 1

Let us see the ratings of user 3 in the following screenshot:

Result interpretation

Ratings by user 3

Now that we have our ratings matrix with updated values, we are ready to recommend products to users. It is common sense to show only the products which the user hasn't rated yet. The right set of recommendations will also enable the seller to pitch the products which have high probability of being purchased by the user.

The usual practice is to return a list of the top N items from the unrated list of products for each user. The user in consideration is usually termed the active-user. Let us consider user 6 as our active-user. This user has only rated Nexus 6, One Plus One, Nexus 5, and iPhone4 in that order of rating, that is Nexus 6 was highly rated and iPhone4 was rated the least. Getting a list of the Top 2 recommended phones for such a customer using our algorithm would result in Moto X and Moto G (very rightly indeed, do you see why?).

Thus, we built a recommender engine smart enough to recommend the right mobile phones to an Android fanboy and saved the world from yet another catastrophe!

Data to the rescue!

This simple implementation of a recommender engine using matrix factorization gave us a flavor of how such a system actually works. Next, let us get into some real world action using recommender engines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset