Michael's Blog: May 2009

Thursday, May 14, 2009

Pearson Correlation of Netflix Prize

This week I start to learn "Pearson Correlation", "KNN", for Netflix prize, and summarized all related information as below. My biggest question is about running time(it happens again, in my previous SVD algorithm, running time was a big issue), some body said it can be finished all movie-to-movie Pearson Correlation (calculate 17770*17770, or about 316 million Pearson correlations) within 2 minutes(I know you can just calculate the upper triangle correlation matrix). Two minutes:( Is it just for attracting people's eyes? In my stupid brute force program to calculate only one(Movie 30(Something's Gotta Give) : size(2.3MB) : total ratings : 118413), needs 7 hours, so honestly, I don't 100% believe how can you calculate all within 2 minutes.

Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)
Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours), Updated on June/11/2009

Fast way to compute correlations?
My kNN source code for Icefox's framework

Sunday, May 10, 2009

Learning Curve of Netflix Prize

After couple days experiments, I thought my "Basic" SVD hit the limitation, but I still have some questions for SVD-like model approach for Netflixe prize.
1.Why in my experiment, 100 features didn't get the better score than 32 features, is it over-fitting(100 features)? or some of my parameters didn't set well for 100 features?
2.How to decide how many feature is best for different problem, how to set the best prediction score of different parameter(feature, learning rate, K) setting? or just try try and try?

By PragmaticTheory's link, it is time to move on kNN algorithm.
My "Basic" SVD-like model's best score is RMSE=0.9137 w/ 32 features, K=0.015, Lrate=0.01

Thursday, May 7, 2009

SSE Basic Tutorial

Recently, my netflix program has lousy running performance, I spent a whole day to learn how to program by SSE/SSE2/SIMD(it can add/multiple 4 float point registers at one instruction, in other words, a 4 for loop can be reduced to 1 for loop, awesome!!) and find a lot of good stuffs, of course, I wrote my own at my play ground, really cool technique skill to reduce running time.

If you don't know what is SSE.
SSE at wiki
Intel SSE
SIMD at wiki
Single instruction multiple data (SIMD)

Some SSE examples
SSE examples
Introduction to SSE Programming By Alex Fr
SSE(w/ nice graph for register)
Using Inline Assembly in C/C++
I try the float point code, it is ok for cygwin(gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)), but can't run at Ubuntu(), why?

Pointer and Array
Basic tutorial

But always has disadvantage
By wiki SIMD:
Gathering data into SIMD registers and scattering it to the correct destination locations is tricky and can be inefficient.
Yes, it did, it is really tricky to correct/set/arrange the register/pointer/destination :(
but finally, i set everything right..

Tuesday, May 5, 2009

Open Source Collections for Netflix Prize

C++/C
Linear Regression program of team Gravity
Icefox
Kadence's KNN(base on Icefox)
TD
Turnbeutel(SVD, base on TD)
symbol1
Source code for 0.9046 of netflix forum

Python - played around for couple weeks, very poor performance for SVD, people said need to write an extension by C/C++, ha ha extend by C/C++?, but very concise language :)
Pyflix by Deus ex silicon

Java - curious for java performance
jnetflix

And :)......

NAPOLEON DYNAMITE for Netflix Prize - Part2

By PragmaticTheory(No.1, May/5/2009)

Thread about Napoleon Dynamite at Netflix Prize Forum

Tons of great sharing about this issue, by the way, I like the "ADifferentName"'s post, he gathered a list of valuable information.

Michael's Blog