Netflix Prize Update

I finally got myself out of my tenth place rut by implementing kNN. I used BellKor's item-item kNN model as the basis. However, I couldn't get their non-negative optimizer to work stably enough (and their reference for the algorithm is non-free), so I wrote my own (not non-negative) from Numerical Recipes. I'm still having occasional stability issues with solving the system of linear equations (close to singular, perhaps), but it works well in more than 99.9% of the cases.

I gained about 0.1% (7.59% -> 7.70%) just by adding in a modest rendition of the BellKor removed-global-effects kNN (no factorization or neural network pre-processing yet). Hopefully, I'll get the others added in soon.

There are a few newcomers to the leaderboard not too far behind. It's nice to see the competition heating back up a little.

[Read more about my Netflix Prize work]

Iowa Caucus

Tonight, in the United States state of Iowa, a caucus is being held to select the party candidates that will run in the general election to be held nationally next November. Technically, the primary delegate votes in Iowa will be weighted in with the delegates from every other state at the party conventions, but in practicality, the results of the Iowa caucus (and to a slightly lesser extent the New Hampshire primary voting next Tuesday) will decide the candidates for the rest of the nation.

The population of Iowa is about three million, or 1% of the total population of the nation. Active participation in the caucuses is estimated around 130,000 or so, which comes to 0.04% of the nation. However, considerable momentum is asserted by the winners, mainly due to the notion that the greatest success of the party is garnered from an appearance of uniformity and consensus; also, if secondary candidates can be eliminated early, less money can be spent on duking it out in advertisements against same-party candidates.

The practical upshot for voters throughout the rest of the nation, however, is that their votes don't matter. By the time February 5th rolls around (the day I can vote in the primary in Massachusetts), there will probably be a de facto winner chosen, and my vote won't matter. The problem is not simply a matter of selecting the order that you count the ballots, where after counting a certain portion of the ballots, you get a good idea of who the winner will be. It is a matter of the early voters and caucus-goers indirectly influencing all the later voters to herd like sheep toward the preferred candidate.

Of course, maybe Iowa is like a mini United States, and the smaller pool of people lets lower-profile candidates get around and more easily get their messages across to the populace. However, a brief examination of some census figures quickly put that idea to shame. Iowa is only 2.5% black and 3.8% hispanic, compared with 13% black and 15% hispanic nationally.

And this doesn't end with the primaries. In the United States, we use a delegate system to choose presidents, with delegates weighted by population but also weighted more to smaller states. Most states, including my own, Massachusetts, will assign all the delegate votes to the candidate that wins a simple majority vote in the state. My state specifically has tended to vote very strongly for the the Democrat party for decades. The upshot is that candidates don't care about Massachusetts (since the vote will go Democrat, anyway), and, given that a certain subset of Mass. voters will vote in the defined way, everyone else's votes, the people that actually pay attention, don't matter.

The bottom line is that while I will go out and vote both times, my vote has a near 100% chance of not counting at all.

(note: I wrote this a little loosely, but from some exposure to formal analyses of voting systems, I understand that it is possible to define "voting power" rigorously, looking at what power each person's vote has in possibly changing the outcome. In some trivial systems, such as two wolves and a sheep using a simple majority vote to decide what's for dinner, there is a subset of voters that wield no power at all in affecting the outcome)

Update: The New York Times published an excellent article saying pretty much the same thing, with more detail about the caucus process: http://theboard.blogs.nytimes.com/2008/01/03/report-from-iowa-democracy-it-aint/. This is a little amusing, though, since the NYT website front page is currently all about Iowa, fanning the flames of the problem about which the linked article complains.

Happy New Year

The last week has been pretty crazy.

As perhaps an indication of my nerdiness, I was gifted (thank you!) a Playstation 3 for Christmas - not for gaming, but for programming. The PS3 packs a powerful computational punch, and it ranks very high in computational throughput per hardware dollar. Installation of PSUbuntu was pretty simple (I used a text-only server package), and the only command I had to type into the USB keyboard I got for it was "sudo apt-get install openssh-server." After that, I could disconnect the crappy TV (and keyboard) and just SSH in from a computer with a good display. (update: I installed Yellow Dog Linux instead, as it advertises wireless support... installation of that was also a breeze)

I've long written multi-threaded code with the intent of load-balancing between multiple processors, but only a few months ago did I get my first multi-CPU system, based on an Intel Quad Core Q6600. In October, I wrote some code for the NVidia G80 chipset, using two 8800GTS video cards in parallel to do fruitless calculations for the Netflix Prize. I haven't read extensively on the IBM Cell processor, the core of the PS3 system, but it seems that it combines some features of both approaches to create a nice hybrid design that seems like it might be more useful for Machine Learning applications than GPU.

The main issue at hand will be the very low memory ceiling of 192 MB. I have some ideas as to how to structure the Netflix data, but I'm not sure if it will actually be any benefit compared to my other systems. It will be fun learning about the new architecture, though!

Netflix Prize Update

I know that some people read my RSS feed here, which unfortunately doesn't reflect updates to static pages. So, I guess I'll try blogging a bit.

I'm very happy to now be in tenth place in the Netflix Prize. Prior to December 19th, each of my submissions contained a sizable bug in the blending program. An array was sized one element smaller than it needed to be, and the final value spilled over into another array. To make things interesting, the bug would only have a lot of influence with a certain combination of result sets.

Once I got rid of the bug, I started adding back in dozens of result sets I had previously abandoned (while trying to eliminate possible "bad" result sets that could be polluting the blend). Hence, the sudden jump in score around a week ago isn't really sustainable, meaning that if you try to fit a curve to my score progress, you would certainly err in the direction of being overly optimistic.

Nonetheless, I think there's still some progress to be made. I have yet to implement the kNN method. I've been reading up on that and hope to have something written in the next week, and I have plenty of residual sets to build from there. It seems that BellKor had a particularly good time with kNN using a variety of different residuals. It will be interesting to see how much these will add to my blend.

As for other models, I never got RBM to work (much gnashing of teeth here... I must be missing something subtle in the UToronto paper). Instead, I have substituted a directed neural network that seems to provide similar support to the blend.

The recent surprise that boosted my scores from ~7.33% to ~7.46% was the inclusion of my formerly-abandoned pure NNMF, meaning NNMF with the original, unrecentered ratings. I re-included one result set after reading a forum post by Yehuda Koren in which he said that NNMF without recentering was one of BellKor's strongest models. Wouldn't you know...

More to come in the future. I'll start to divulge some of my secrets as I feel I settle into a given score. My methods may or may not add to the other teams' scores above me (the oddest things can turn out to be redundant), but I'd like to climb the ladder before I make the ladder harder to climb.