This week I start to learn "Pearson Correlation", "KNN", for Netflix prize, and summarized all related information as below. My biggest question is about running time(it happens again, in my previous SVD algorithm, running time was a big issue), some body said it can be finished all movie-to-movie Pearson Correlation (calculate 17770*17770, or about 316 million Pearson correlations) within 2 minutes(I know you can just calculate the upper triangle correlation matrix). Two minutes:( Is it just for attracting people's eyes? In my stupid brute force program to calculate only one(Movie 30(Something's Gotta Give) : size(2.3MB) : total ratings : 118413), needs 7 hours, so honestly, I don't 100% believe how can you calculate all within 2 minutes.
Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)
Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours), Updated on June/11/2009
Fast way to compute correlations?
My kNN source code for Icefox's framework
Thursday, May 14, 2009
Sunday, May 10, 2009
Learning Curve of Netflix Prize
After couple days experiments, I thought my "Basic" SVD hit the limitation, but I still have some questions for SVD-like model approach for Netflixe prize.
1.Why in my experiment, 100 features didn't get the better score than 32 features, is it over-fitting(100 features)? or some of my parameters didn't set well for 100 features?
2.How to decide how many feature is best for different problem, how to set the best prediction score of different parameter(feature, learning rate, K) setting? or just try try and try?
By PragmaticTheory's link, it is time to move on kNN algorithm.
My "Basic" SVD-like model's best score is RMSE=0.9137 w/ 32 features, K=0.015, Lrate=0.01
1.Why in my experiment, 100 features didn't get the better score than 32 features, is it over-fitting(100 features)? or some of my parameters didn't set well for 100 features?
2.How to decide how many feature is best for different problem, how to set the best prediction score of different parameter(feature, learning rate, K) setting? or just try try and try?
By PragmaticTheory's link, it is time to move on kNN algorithm.
My "Basic" SVD-like model's best score is RMSE=0.9137 w/ 32 features, K=0.015, Lrate=0.01
Thursday, May 7, 2009
SSE Basic Tutorial
Recently, my netflix program has lousy running performance, I spent a whole day to learn how to program by SSE/SSE2/SIMD(it can add/multiple 4 float point registers at one instruction, in other words, a 4 for loop can be reduced to 1 for loop, awesome!!) and find a lot of good stuffs, of course, I wrote my own at my play ground, really cool technique skill to reduce running time.
If you don't know what is SSE.
SSE at wiki
Intel SSE
SIMD at wiki
Single instruction multiple data (SIMD)
Some SSE examples
SSE examples
Introduction to SSE Programming By Alex Fr
SSE(w/ nice graph for register)
Using Inline Assembly in C/C++
I try the float point code, it is ok for cygwin(gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)), but can't run at Ubuntu(), why?
Pointer and Array
Basic tutorial
But always has disadvantage
By wiki SIMD:
Gathering data into SIMD registers and scattering it to the correct destination locations is tricky and can be inefficient.
Yes, it did, it is really tricky to correct/set/arrange the register/pointer/destination :(
but finally, i set everything right..
If you don't know what is SSE.
SSE at wiki
Intel SSE
SIMD at wiki
Single instruction multiple data (SIMD)
Some SSE examples
SSE examples
Introduction to SSE Programming By Alex Fr
SSE(w/ nice graph for register)
Using Inline Assembly in C/C++
I try the float point code, it is ok for cygwin(gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)), but can't run at Ubuntu(), why?
Pointer and Array
Basic tutorial
But always has disadvantage
By wiki SIMD:
Gathering data into SIMD registers and scattering it to the correct destination locations is tricky and can be inefficient.
Yes, it did, it is really tricky to correct/set/arrange the register/pointer/destination :(
but finally, i set everything right..
Tuesday, May 5, 2009
Open Source Collections for Netflix Prize
C++/C
Linear Regression program of team Gravity
Icefox
Kadence's KNN(base on Icefox)
TD
Turnbeutel(SVD, base on TD)
symbol1
Source code for 0.9046 of netflix forum
Python - played around for couple weeks, very poor performance for SVD, people said need to write an extension by C/C++, ha ha extend by C/C++?, but very concise language :)
Pyflix by Deus ex silicon
Java - curious for java performance
jnetflix
And :)......
Linear Regression program of team Gravity
Icefox
Kadence's KNN(base on Icefox)
TD
Turnbeutel(SVD, base on TD)
symbol1
Source code for 0.9046 of netflix forum
Python - played around for couple weeks, very poor performance for SVD, people said need to write an extension by C/C++, ha ha extend by C/C++?, but very concise language :)
Pyflix by Deus ex silicon
Java - curious for java performance
jnetflix
And :)......
NAPOLEON DYNAMITE for Netflix Prize - Part2
By PragmaticTheory(No.1, May/5/2009)
Thread about Napoleon Dynamite at Netflix Prize Forum
Tons of great sharing about this issue, by the way, I like the "ADifferentName"'s post, he gathered a list of valuable information.
Subscribe to:
Posts (Atom)