As we mentioned earlier, content-based filtering systems provide users with recommendations based on their past behavior as well as the characteristics of items that are positively rated or liked by the given user. We can also take into account the items that were disliked by the given user. An item is generally represented by several discrete attributes. These attributes are analogous to the input variables or features of a classification or linear regression based machine learning model.
For example, suppose we want to build a recommendation system that uses content-based filtering to recommend online products to its users. Each product can be characterized and identified by several known characteristics, and users can provide a rating for each characteristic of every product. The feature values of the products can have values between the 0 and 10, and the ratings provided by users for the products will have values within the range of 0 and 5. We can visualize the sample data for this recommendation system in a tabular representation, as follows:
In the preceding table, the system has products and users. Each product is defined by features, each of which will have a value in the range of 0 and 10, and each product is also rated by a user. Let the rating of each product by a user be represented as . Using the input values , or rather the input vector , and the rating of a user , we can estimate a parameter vector that we can use to to predict a user's rating. Thus, content-based filtering in fact applies a copy of linear regression to each user's rating and each product's feature values to estimate a regression model that can in turn be used to estimate the users rating for some unrated products. In effect, we learn the parameter using the independent variables and the dependent variable and for all the users the system. Using the estimated parameter and some given values for the independent variables, we can predict the value of the dependent variable for any given user. The optimization problem for content-based filtering can thus be expressed as follows:
The optimization problem defined in the preceding equation can be applied to all users of the system to produce the following optimization problem for U users:
In simple terms, the parameter vector tries to scale or transform the input variables to match the output variable of the model. The second term that is added is for regularization. Interestingly, the optimization problem defined in the receding equation is analogous to that of linear regression, and thus content-based filtering can be considered as an extension of linear regression.
The key issue with content-based filtering is whether a given recommendation system can learn from a user's preferences or ratings. Direct feedback can be used by asking for the rating of items in the system that they like, although these ratings can also be implied from a user's past behavior. Also, a content-based filtering system that is trained for a set of users and a specific category of items cannot be used to predict the same user's ratings for a different category of items. For example, it's a difficult problem to use a user's preference for news to predict the user's liking for online shopping products.