Matrix factorization pretty much means dividing a matrix into two parts, that multiplied will result in the original matrix or a closely similar one. Given a user-item matrix, we get the user-factor and item-factor matrices. Factors are a common set of features that characterize both the users and the items. Why do we want to use this for rating prediction or for recommendation? The trick is in the word "similar". The resulting matrices, when multiplied, will product a very similar matrix to the original. So there's a transformation of the "training data" (our ratings matrix for a set of users and items) that enables us to estimate a rating for values that were previously zero. Interestingly, Singular Value Decomposition (SVD) can capture the global structure of the training set. It's quite magical!
Notice that we are assuming that unrated items are zero. This (very strong) assumption is called imputation. Imputation is the filling of missing values (ratings) in the matrix so that we can then run the SVD, which doesn't exist mathematically for sparse matrices. By sparse we mean many missing values, not necessarily many zeros, but yeah, most of the times our zeros represent the missing values, so we've come to think of sparse as many zeros and only a few nonzero values. Going back, notice that imputation might also be the act of replacing missing values with the row mean or the column mean, or both means sequentially; it's not always the act of replacing missing values by zero. Why don't we use the means instead of zeros then? Simply because, in computer science, calculating the SVD for dense matrices is not as optimized as doing it for sparse matrices. What we can do is use mean imputation as a baseline for evaluation, but more on that in k-fold cross-validation.
We've been talking about factorizing a matrix into two new matrices that multiplied will give you back a very similar matrix. However, in SVD, you could argue that there are in fact three matrices. The third matrix (the one in the middle of the formula, which is Σ) represents the relevance or weight of each feature, for the description of the two dimensions of the original matrix. By two dimensions we mean the users and the items, but in other contexts, such as information retrieval, users could be terms and items could be documents, thus applying the same idea to something that is called Latent Semantic Indexing (LSI). In fact, if you have any doubts about latent factor models using SVD as your matrix factorization methodology, go check informa retrieval bibliography on LSI, namely Berry et al. (1995). They have been doing it since forever and have it really well explained (that's how I understood many details). Oh man, now I'm giving out my whole secrets!
We got the two matrices and that third one in the middle. So, now what? Well, you just multiply the three matrices and get the predicted ratings for all user-item combinations. With this, you can simply fill the missing ratings (imputed zeros) in the original matrix and do whatever you want with it. Maybe you want to recommend, right? Use predicted ratings as scores and rank the results by these scores. Then just return the top-n recommendations of unrated items. That's it, the whole basic idea behind rating prediction and recommendation using latent factor models!
Koren et al. (2009), the winners of the Netflix competition, did a really good job at explaining in detail how to use latent factor models for rating prediction and recommendation. In fact, they have two figures that perfectly convey the differences between neighborhood methods and latent factor models. If you look at their Figure 2, it will be quite clear that in latent factor models we are positioning users and items in an n-dimensional space (in their example, it's n=2 dimensions) and then we can easily recommend near items to each user. This was not what I discussed above, but it also works. Why?
Latent means hidden, and the factors are the features that characterize users and items. So
yes, we are characterizing users and items using the same set of features! The problem is
they are latent, and even though we know their values, we don't know what they are. If we
were talking about music for example, we could think of it in the following way. A user
likes songs that have [speed=0.8, genre=metal, sentiment=+0.6, popularity=0.9, ...].
At the same time, a song can have: [speed=0.7, genre=metal, sentiment=+0.8, popularity=0.85, ...].
However, to find nearest neighbors, or to calculate probable ratings, we don't need to know what
the features represent. What we lose in interpretation, we gain in global context.
We could select a set of features and then try to manually or automatically classify users
and songs based on those, but in fact we don't know whether they'd be the best to
characterize our data. In our opinion and for our taste or what we perceive of global
taste, they are, but we're only taking into account the opinion of one. What latent factor
models do is uncover these unknown features that might even (and probably do) represent
abstract concepts that we cannot grasp, but that are representative of the factorized
matrix and thus of the data as a whole. We can think of latent factors as something like
[0.8, 1, +0.8, 0.9, ...], in equivalence to the user we characterized before,
but now the features are unlabeled and unknown and "metal" is now defined by the integer 1.
As I stated, remember that these features might not even represent something that we can
easily map to reality.
Now, let's talk about the elephant in the room: scalability. Imagine your original matrix was huge and that it took a really long time and resources to get it factorized. First, are you gonna want to wait for the factorization for a long time? And then, are you gonna want to multiply two smaller, yet still quite large, matrices again? Well, you're gonna, but you might want to take a break while you're waiting. And I don't mean a coffee break, I mean a vacation or something!
Instead of factorizing the whole matrix, however, you can build your model incrementally (if by model you understood the three matrices resulting from the SVD, you got it right). But, José, you ask, how can I do this? Well, dear reader who is still here after many attempted jokes... You have to factorize the matrix for a subset of the users (whatever fits in memory, for example) and then iteratively "fold-in" new users (or blocks of new users) by calculating the projection into the latent factors space and appending the result to the $U$ matrix. Note that the users in the first matrix to be factorized already had ratings (or imputed zeros for missing values) for all items in the collection, so iteratively added users cannot include new items (we're only adding the user vectors from matrix $M$ incrementally). Using a similar approach, you can also add new items, but you have to do it with a complete item vector, with the ratings of all existing users for that item (many ratings might be zero though, or even all of them).
Below we provide a table that aggregates the most important formulas to implement SVD. One thing you have to remember is that after users and items are mapped into the latent factors space, any calculations that involve the factorized matrices must be done after projecting our original data into this new vector space.
In order to calculate the projection of a user vector into the latent factor space, we can use formula 2 from the table below. Then, you can either append this vector to matrix $U$, to build your model incrementally, or, with formula 4, use the projected user vector to query your model and obtain the predicted ratings for that user. In Latent Semantic Indexing, with terms and documents instead of users and items, and for a given query, we would predict the weight of all terms in relation to this query, thus giving a context to our query and enabling us to also search for the most related words in the universe of the collection.
| Description | Formula | |
|---|---|---|
| 1 | Singular Value Decomposition (SVD) of a matrix of users items. | |
| 2 | Projecting vector of user ratings (of all items) into latent factors space. | |
| 3 | Projecting vector of item ratings (by all users) into latent factors space. | |
| 4 | Querying the model to predict the ratings of a projected user . |