Designing the content-based recommendation engine

To rewrite our customer requirements in plain English: When a customer browses a particular article, what other articles should we suggest to him?

Let's quickly recap how a content-based recommendation engine works. When a user is browsing a product or item, we need to provide recommendations to the user in the form of other products or items from our catalog. We can use the properties of the items to come up with the recommendations. Let's translate this to our use case.

Items in our case, are news articles.

The properties of a news article are as follows:

Its content, stored in a text column
The publisher--who published the article
The category to which the article belongs

So when a user is browsing a particular news article, we need to give him other news articles as recommendations, based on:

The text content of the article he is currently reading
The publisher of this document
The category to which this document belongs

We are going to introduce another feature. It is a calculated feature from the text field:

Polarity of the document. Subjective text tends to have an opinion about a topic. The opinion can be positive, negative, or neutral. A polarity score is a real number which captures these opinions. Polarity identification algorithms use text mining to get the document opinion. We are going to use one such algorithm to get the polarity of our texts.

For the wine example in the last section, we used the pearson coefficient as a similarity measure. Unlike the wine example, we need multiple similarity measures for this use case:

Cosine distance/similarity for comparing words in two documents
For the polarity, a Manhattan distance measure
For the publisher and category, Jaccard's distance

We will explain these distance measures as we program them in R. Hopefully this gave you a good overview of our problem statement. Let's move on to the design of our content-based recommendation engine.

We are going to design our content-based recommendation engine in three steps, as shown in the following diagram:

In step 1, we will create a similarity index. Think of this index as a matrix, with rows and columns as the articles and the cell value storing the similarity between the articles. The diagonal values of this matrix will be one. The similarity score is a value between zero and one. A cell value of 1.0 indicates that the two articles are an exact replica of each other.

Let's look at an example matrix:

Article ID / Article ID	1	2	3	...	N
1	1	0.2	0	...	0.8
2	0.4	1	0	...	0
3	0.1	0	1	...	0.1
....	...	...	...	...	...
N	...	...	...	...	...

Look at the first row; article 1 is closer to article N when compared to article 2. In our use case, the cell value will be the cosine similarity between two documents.

In step 2, we have a simple search engine. Given an article, this engine will first retrieve the top N articles, in our case the top 30, which are close to the given article, based on the similarity matrix developed in the previous step. Say we are searching for article number 2, then row 2 of this matrix will be accessed. After sorting the content of the row, the top 30 will be given as the match. For those 30 articles, we further calculate more features the polarity of the articles. After that, we calculate Manhattan distance between the polarity value of the given article and all the other articles in our search results. We find the Jaccard's distance between the article we are searching for and all the other articles in the search list based on the publisher and category.

In step 3, we implement a fuzzy ranking engine. Using the similarity score from step 1, Jaccard's score, and the polarity scores, we use a fuzzy engine to rank the top 30 matches. The results are presented to the user in this ranked order.

Let's proceed to step 1, building a similarity index/matrix.

Table of Contents for Designing the content-based recommendation engine

Create new playlist

Sign In

Sign Up

Table of Contents for
Designing the content-based recommendation engine