Data science is hot. Really hot ^{1}.

What can you do with data science? This post deals with one particular application, called collaborative filtering. It’s called collaborative filtering because it is a way of *filtering*, or processing, data from multiple sources (which are collaborating together) to do something useful. Doesn’t make sense to you? Me neither. For now, just think of collaborative filtering as a way of generating recommendations for users.

The concept of a recommender system was first brought to widespread attention by the announcement of the Netflix Prize. In 2006, Netflix announced a competition to produce a better movie rating prediction system than their current one, at the time called Cinematch. The idea is that Netflix would like to predict how well its users might like individual movies, so that it could recommend movies to them. The more accurately they would be able to predict users ratings of movies, the better for their business. The grand prize was $1 million.

Data science competitions have become quite popular now. They are a way for people to learn about and demonstrate (sometimes) their proficiency in the field. The website http://kaggle.com is a prime example of this.

Back to the Netflix Prize. Each movie that is rated by a user is given a 1 to 5 star rating, 5 being the best. One way to measure the accuracy of a movie rating prediction is to compute the Root-Mean-Square-Error, or RMSE for predictions. That means you add up the square of the errors of all its predictions, and take the square root. This is a common *metric, *or yardstick (meter-stick?*) *for errors. Cinematch had an RMSE of 0.9514 on the test data, which means that, on average, it predicted within 0.9514 stars of a users actual rating. The contest was to see (and award $1 million) to a team that could improve that accuracy by at least 10%, meaning an RMSE of 0.8572 or better.

The actual contest data looked like this. There are 480,189 users. There are 17,770 movies. If everyone rated every movie, there would be 480,189 x 17,770 ratings, but of course, not everyone watched or rated every movie. They give you 100,480,507 ratings (1-5 each), which is about 1% of all the possible ratings. The idea is that they use about 1.4 million ratings (the test data) that have NOT been provided to you to determine the winners – you have to provide your estimates for those 1.4 million entries, and the RMSE of your estimates provides your final score. There were some other details, like they had a separate 1.4 million ratings (also not provided, the qualifying data) that people could use to submit interim estimates and get scores and be posted on the leaderboard. This was to prevent people from gaming the system by constantly getting a score on the qualifying data to determine what some of the true answers were.

How would you go about solving this problem? At first glance, it might seem impossible. How could you predict how much someone would like a movie that they have never seen or rated before?

However, let’s discuss some ways that we could say things about how one viewer, let’s call her Mary, would rate a particular movie, say The *Call of the Wild*. That is, what should the (Mary, *Call of the Wild*) pair be rated?

First, some notation. Let \(A\) be the matrix of ratings, with 0 for no rating, otherwise 1-5 for the number of stars. Each row represents a single user, and each column represents a single movie. Let \(a_{ij}\) be the entry for user \(i\) and movie \(j\). Let there be \(m\) users and \(n\) movies. Also, it will be convenient to use an indicator variable \(\delta_{ij}\) to indicate if a (user, movie) pair has a rating:

\begin{align}

A=&\left[ a_{ij} \right] \\

\delta_{ij} =& \begin{cases} 1 & \text{user i rated movie j} \\

0 & \text{otherwise}

\end{cases}

\end{align}

Let’s start with the simplest possible system. If we didn’t know anything, what should we try to guess for that pair? We might guess the average rating that all users gave to all movies.

That would be

\[

\text{average rating}=\frac{\sum\limits_{i=1}^m \sum\limits_{j=1}^n a_{ij}} {\sum\limits_{i=1}^m \sum\limits_{j=1}^n \delta_{ij} }

\]

But maybe we can do better. If Mary rated every movie she saw as 5 stars, we might predict the (Mary, Call of the Wild) pair as 5 stars. Likewise, if everyone who saw the the Call of the Wild rated it 5 stars, we might predict that pair as 5 stars as well. The equation for predicting each movie’s rating as its average rating is:

\[

\text{average rating for movie k}=M_k=\frac{\sum\limits_{i=1}^m a_{ik}} {\sum\limits_{i=1}^m \delta_{ik} }

\]

It turns out that if you predict each rating as the average of each movie’s rating, you will get an RSME of 1.0540

But consider this. Suppose Mary liked movie A, B, and C and hated D. And Joe also liked A, B, and C, but hated D. Maybe Mary would like E the same as Joe would. In table form, suppose the data were:

A | B | C | D | E |

5 | 4 | 5 | 1 | |

5 | 4 | 5 | 1 | 2 |

2 | 2 | 2 | 5 |

Or, in our matrix format

\[

A=

\begin{bmatrix}

5 & 4 & 5 & 1 & 0\\

5 & 4 & 5 & 1 & 2\\

2 & 2 & 2 & 5 & 0\\

\end{bmatrix}

\]

where we have set unrated entries to 0.

This would allow us to get even better. There are a number of ad hoc ways that this type of reasoning can be performed, and we will visit some of those methods in a future post.

For now, we would like to see how a more general approach using mathematics could be developed.

Singular Value Decomposition is a method a decomposing, or describing, a matrix

\[A=USV’\]

Where \(U, V\) have orthonormal columns and S is diagonal with non-negative entries. The way to think of this decomposition is that the rows \(U\) tell us how much each user likes a particular “feature” and the columns of \(V\) tell us how each movie has each “feature”. The weights in the diagonal of \(S\) give us the overall importance of those features. What are these “features” you ask. We don’t know – it’s what the SVD has determined from the data. To get a feel for how this works, imagine that the features are action, good acting, famous actors, and so on. You can see how different movies would have different amounts of these features, and how different viewers would weight these characteristics differently. The beauty of this approach is we do not have to try to determine what the features are – the SVD does it for us.

A key fact that we will use: The best k-rank approximation to a matrix A can be found by

\[

A=U_aS_aV_a’

\]

where the a subscripted variables only include the k-largest singular values. This k-rank approximation is best in the Frobenius norm sense , which should remind us of an RMSE approximation.

However, we have a small problem. If we choose a model \(A_m=USV’\) for our ratings, and we had ratings at *every position,
*then having a model close to our A would be good. Unfortunately, there are a number of 0’s in the A matrix (for unrated entries), and we want to minimize the RMSE (we have left off a constant which divides the total squared error by the number of entries, but that does not affect the optimization)

\[

E=\sum_i^n\sum_j^m\left(A-USV’\right)_{ij}^2\delta_{ij}

\]

where \(\delta_{ij}\) is 1 if there is a rating at position \((i,j)\), and 0 otherwise.

But it seems we are close. See my post on how to solve complex problems [post ref here].

Let us define the SVD k-rank approximation algorithm as

\[

\operatorname{SVD}_k[A] \triangleq U_kS_kV_k’ \quad \text{where } S_k \text{ has been reduced to k-largest singular values}

\]

Then consider the following algorithm to generate a series of linear models \(X^{(k)}\)

\begin{align}

X^{(0)} =& \operatorname{SVD}_k \\

X^{(n+1)} =& \operatorname{SVD}_k \left[ \begin{cases} A & \text{where }a_{ij} > 0 \\

X^{(n)} & \text{elsewhere}

\end{cases}

\right]

\end{align}

Let’s analyze what is happening here. If one of the models \(X^{(n)}\) fit the data \(A\) exactly, then the next model \(X^{(n+1)}\) would be the same – the \(X^{(n)}\) would be a **fixed point** of the algorithm.

The convergence of this algorithm is not guaranteed theoretically, but in practice it converges.

**A Big Matrix**

One problem you will immediately run into if you try the above solution is that \(A\) has about 8,500,000,000 entries. Most SVD solvers won’t like this. However, all is not lost. There is a method, called the Arnoldi method, for computing SVD (eigenvalues, really, but that helps us do an SVD), that does not require the actual matrix A, but instead only requires computing \(Av_i\) for a (small) series of vectors \(v_i\). That is, the Arnoldi method gives you a \(v_i\) and you give it back \(Av_i\), never exposing the full matrix \(A\). Reliable code to do this is ARPACK and has been (and can be) easily integrated into various languages, like C/C++ or Python.

Notes:

- Harvard Business Review, Data Science: The Sexiest Job of the 21st Century, October 2012 ↩