Explicit v.s. Implicit ratings
Two ways to gather user preference data to recommend items,
Ask for explicit ratings from a user, on a rating scale (such as rating a movie from one to five stars). The drawback is it puts the responsibility of data collection on the user, who may not want to take the time to enter ratings.
Implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more challenging to work with.
The primary difference between these two approaches is that CF looks for similar users to recommend items while CBF looks for similar content to recommend items.
Content-Based Filtering
- Recommend items similar to the items previously liked by the user
The rating of the item is predicted using the user model which can be achieved in two different ways:
Predicting ratings using parametric models like regression or logistic regression for multiple ratings and binary ratings respectively based on the previous ratings.
Similarity-based techniques using distance measures to find similar items to the items liked by the user based on item features.
Advantages and Disadvantages
CB can be applied even when a strong user base is not built, as it depends on the item’s metadata (features) and therefore does not suffer from cold-start problems.
However, this also makes it computationally intensive, as similarities between each user and all the items must be computed.
Since the recommendations are based on the items similarity to the item that the user already knows about, it leaves no room for serendipity and causes over-specialisation.
CB also ignores the popularity of an item and other users’ feedback.
Collaborative filtering (CF)
Collaborative filtering aggregates the past behaviour of all users.
It recommends items to a user based on the items liked by another set of users whose likes (and dislikes) are similar to the user under consideration. This approach is also called the user-user-based CF.
item-item-based CF became popular later, where to recommend an item to a user, the similarity between items liked by the user and other items is calculated.
The user-user CF and item-item CF can be achieved by two different ways, memory-based (neighbourhood approach) and model-based (latent factor model approach).
1. The memory-based approach
Neighbourhood approaches are most effective at detecting very localised relationships (neighbours), ignoring other users.
But the downsides are that, first, the data gets sparse which hinders scalability, and second, they perform poorly in terms of reducing the RMSE (root-mean-squared-error) compared to other complex methods.
User-based Filtering and Item-based Filtering are the two ways to approach memory-based collaborative filtering.
User-based Filtering:
- First, a set of users whose likes and dislikes are similar to the user u1 is found using similarity metrics which captures the intuition that sim(u1, u2) >sim(u1, u3) where user u1 and u2 are similar and user u1 and u3 are dissimilar. similar user is called the neighbourhood of user u1.
**Item-based Filtering:**
* To recommend items to user u1 in the item-item-based neighbourhood approach the similarity between items liked by the user and other items is calculated.
2. The model-based approach
Latent factor model-based collaborative filtering learns the (latent) user and item profiles (both of dimension K) through matrix factorisation by minimising the RMSE (Root Mean Square Error) between the available ratings y and their predicted values yˆ.
Here each item i is associated with a latent (feature) vector xi, each user u is associated with a latent (profile) vector theta(u), and the rating yˆ(ui) is expressed as
Advantages and Disadvantages
Latent methods deliver prediction accuracy superior to other published CF techniques.
It also addresses the sparsity issue faced by other neighbourhood models in CF.
The memory efficiency and ease of implementation via the gradient-based matrix factorization model (SVD) have made this the method of choice within the Netflix Prize competition.
Latent factor models are only effective at estimating the association between all items at once but fail to identify strong associations among a small set of closely related items.