Netflix Prize

I have registered for the Netflix Prize contest. This was my first act of 2007. There are already a number of teams competing and it looks like the Progress Award will be taken by somebody this year.

Ramblings

I've downloaded and unpacked the data file on my PowerBook G4 laptop. Yes, I'm using a low powered 3+ year old computer for this. The problem looks like something for a relational database. Certainly that would be the best place to store the data. But I'm thinking of just doing the entire thing in Lisp just to reduce dependencies on other software. Although I could use PostgreSQL. There is a package for Lisp to make direct calls into PostgreSQL. Maybe I'll change my mind in the future and use it.

I think my first step will be to just read in the data (or just read through it as there is so much of it) and look for relations. I'm thinking that people who think alike will like the same movies. Maybe I can find those groups and predict how someone in a group will rate a movie based on what other people in the group thought.

After much doing of nothing, I finally got some code together. It doesn't do much. But I have read in the training set. I've also generated my first submission file. I gave every movie a rating of 3.5 stars. This is the response I got.

For David Steuber:
Your prediction file submitted 2007-01-19 09:51:23 has been decompressed and processed.
The computed RMSE for the quiz subset was 1.1421.

I'm also thinking of trying random guesses to see how that works out.

I've created a probe-data.txt file which contains the ratings from the probe.txt file. It would probably be a good idea to remove the probe data from the training set so I don't have to code around ignoring it. I also created an rmse-vs-probe-data function for testing against the probe data. To try it out, I guessed 3.5 stars against all the movies in the probe file and got an RMSE of 1.1407237.

I'm never sure of my math, so I wanted to make absolutely sure that I was calculating the RMSE correctly (Netflix provides a Perl program as an example).

CL-USER> (time (apply-guessing-to-file *probe-file* 
                                       *probe-test*
                                       (lambda (a b) 
                                         (+ 0.5 (get-user-movie-rating a b)))))
                                                                         
Evaluation took:
  488.07 seconds of real time
  248.79 seconds of user run time
  56.82 seconds of system run time
  [Run times include 38.0 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  3,916,359,792 bytes consed.
NIL
CL-USER> (time (rmse-vs-probe-data *probe-test*))
1408395 pairs RMSE: 0.5
Evaluation took:
  108.995 seconds of real time
  70.95 seconds of user run time
  17.24 seconds of system run time
  [Run times include 18.98 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  1,967,183,808 bytes consed.
0.5

This code lets out a major secret. My code runs slow on my machine. Well, I am not optimizing for speed yet. I'm still just casting about for the right formula to relate a user who is about to rate a movie to users who have already rated that movie.

Predicting Individual Human Behavior

Doesn't this seem impossible?