Step 3 - Computing similarity

Using item-based collaborative filtering, we can compute how similar two movies are to each other. We follow these steps:

For every pair of movies (A, B), we find all the users who rated both A and B
Now, using the preceding ratings, we compute a Movie A vector, say X, and a Movie B vector, say Y
Then we calculate the correlation between X and Y
If a user watches movie C, we can then recommend the most correlated movies with it

We then compute the various vector metrics for each ratings vector X and Y, such as size, dot product, norm, and so on. We will use these metrics to compute the various similarity metrics between pairs of movies, that is, (A, B). For each movie pair (A, B), we then compute several measures such as cosine similarity, Jaccard similarity correlation, and regularized correlation. Let's get started. The first two steps are as follows:

// get num raters per movie, keyed on movie id 
val numRatersPerMovie = ratings 
    .groupBy(tup => tup._2) 
    .map(grouped => (grouped._1, grouped._2.size)) 
 
// join ratings with num raters on movie id 
val ratingsWithSize = ratings 
    .groupBy(tup => tup._2) 
    .join(numRatersPerMovie) 
    .flatMap(joined => { 
      joined._2._1.map(f => (f._1, f._2, f._3, joined._2._2)) 
    })

The ratingsWithSize variable now contains the following fields: user, movie, rating, and numRaters. The next step is to make a dummy copy of ratings for self-join. Technically, we join to userid and filter movie pairs so that we do not double-count and exclude self-pairs:

val ratings2 = ratingsWithSize.keyBy(tup => tup._1) 
val ratingPairs = 
    ratingsWithSize 
      .keyBy(tup => tup._1) 
      .join(ratings2) 
      .filter(f => f._2._1._2 < f._2._2._2)

Now let's compute the raw inputs to similarity metrics for each movie pair:

val vectorCalcs = ratingPairs 
      .map(data => { 
        val key = (data._2._1._2, data._2._2._2) 
        val stats = 
          (data._2._1._3 * data._2._2._3, // rating 1 * rating 2 
            data._2._1._3, // rating movie 1 
            data._2._2._3, // rating movie 2 
            math.pow(data._2._1._3, 2), // square of rating movie 1 
            math.pow(data._2._2._3, 2), // square of rating movie 2 
            data._2._1._4, // number of raters movie 1 
            data._2._2._4) // number of raters movie 2 
        (key, stats) 
      }) 
.groupByKey() 
.map(data => { 
    val key = data._1 
    val vals = data._2 
    val size = vals.size 
    val dotProduct = vals.map(f => f._1).sum 
    val ratingSum = vals.map(f => f._2).sum 
    val rating2Sum = vals.map(f => f._3).sum 
    val ratingSq = vals.map(f => f._4).sum 
    val rating2Sq = vals.map(f => f._5).sum 
    val numRaters = vals.map(f => f._6).max 
    val numRaters2 = vals.map(f => f._7).max 
        (key, (size, dotProduct, ratingSum, rating2Sum, ratingSq, rating2Sq, numRaters, numRaters2))})

Here are the third and the fourth steps for computing the similarity. We compute similarity metrics for each movie pair:

  val similarities = 
    vectorCalcs 
      .map(fields => { 
        val key = fields._1 
        val (size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq, numRaters, numRaters2) = fields._2 
        val corr = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq) 
        val regCorr = regularizedCorrelation(size, dotProduct, ratingSum, rating2Sum,ratingNormSq, rating2NormSq, PRIOR_COUNT, PRIOR_CORRELATION) 
        val cosSim = cosineSimilarity(dotProduct, scala.math.sqrt(ratingNormSq), scala.math.sqrt(rating2NormSq)) 
        val jaccard = jaccardSimilarity(size, numRaters, numRaters2) 
        (key, (corr, regCorr, cosSim, jaccard))})

Next is the implementation of the methods we just used. We start with the correlation() method for computing the correlation between the two vectors (A, B) as cov(A, B)/(stdDev(A) * stdDev(B)):

def correlation(size: Double, dotProduct: Double, ratingSum: Double, 
    rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double) = { 
    val numerator = size * dotProduct - ratingSum * rating2Sum 
    val denominator = scala.math.sqrt(size * ratingNormSq - ratingSum * ratingSum)  
                        scala.math.sqrt(size * rating2NormSq - rating2Sum * rating2Sum) 
    numerator / denominator}

Now, the correlation is regularized by adding virtual pseudocounts over a prior, RegularizedCorrelation = w * ActualCorrelation + (1 - w) * PriorCorrelation where w = # actualPairs / (# actualPairs + # virtualPairs):

def regularizedCorrelation(size: Double, dotProduct: Double, ratingSum: Double, 
    rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double, 
    virtualCount: Double, priorCorrelation: Double) = { 
    val unregularizedCorrelation = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq) 
    val w = size / (size + virtualCount) 
    w * unregularizedCorrelation + (1 - w) * priorCorrelation 
  }

The cosine similarity between the two vectors A, B is dotProduct(A, B) / (norm(A) * norm(B)):

def cosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double) = { 
    dotProduct / (ratingNorm * rating2Norm) 
  }

Finally, the Jaccard Similarity between the two sets A, B is |Intersection (A, B)| / |Union (A, B)|:

def jaccardSimilarity(usersInCommon: Double, totalUsers1: Double, totalUsers2: Double) = { 
    val union = totalUsers1 + totalUsers2 - usersInCommon 
    usersInCommon / union 
    }

Table of Contents for Step 3 - Computing similarity

Create new playlist

Sign In

Sign Up

Table of Contents for
Step 3 - Computing similarity