Michael's Blog: 2009

Sunday, August 2, 2009

Engine Yard SHA-1 hash programming contest

Programming Contest! Win iPhone 3GS & $2,000 Cloud Credit on July 14, 2009
Contest Kick off post
Winner announce post
Winner's blog
Unofficial leaderboard
Winner's Source Code
But I can't find the check.c source file, ??

Tuesday, June 30, 2009

Before/After "BellKor's Pragmatic Chaos" appear

"BellKor's Pragmatic Chaos" submitted a prediction which has a 10% improvement than Cinematch on June/26/2009, that means they can grab $1M if nobody can beat them after 30days, so due day is July/26/2009, and the world become more interesting and energetic !!

Before
1.Michael Jackson died on June/25/2009 (my pop idol..)

After
2.Netflix Leaderboard only show top 100, but June/30/2009, it can show top 20,50,100,250,1000.
3."Grand Prize Team" pass "PragmaticTheory" on June/30/2009
4."xlvector" pass "Progress Prize 2008" and became No.6 on June/29/2009
5."Vandelay Industries !" became No.9 and go up 12 position on June/29/2009
6."Vandelay Industries !" became No.7 on July/01/2009
7."Vandelay Industries !" suddenly disappear on July/02/2009, does it violate the rule?

What a wonderful world :],
suddenly everyone become a super man or they already hide for a long time and need a push sign..

Thursday, May 14, 2009

Pearson Correlation of Netflix Prize

This week I start to learn "Pearson Correlation", "KNN", for Netflix prize, and summarized all related information as below. My biggest question is about running time(it happens again, in my previous SVD algorithm, running time was a big issue), some body said it can be finished all movie-to-movie Pearson Correlation (calculate 17770*17770, or about 316 million Pearson correlations) within 2 minutes(I know you can just calculate the upper triangle correlation matrix). Two minutes:( Is it just for attracting people's eyes? In my stupid brute force program to calculate only one(Movie 30(Something's Gotta Give) : size(2.3MB) : total ratings : 118413), needs 7 hours, so honestly, I don't 100% believe how can you calculate all within 2 minutes.

Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)
Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours), Updated on June/11/2009

Fast way to compute correlations?
My kNN source code for Icefox's framework

Sunday, May 10, 2009

Learning Curve of Netflix Prize

After couple days experiments, I thought my "Basic" SVD hit the limitation, but I still have some questions for SVD-like model approach for Netflixe prize.
1.Why in my experiment, 100 features didn't get the better score than 32 features, is it over-fitting(100 features)? or some of my parameters didn't set well for 100 features?
2.How to decide how many feature is best for different problem, how to set the best prediction score of different parameter(feature, learning rate, K) setting? or just try try and try?

By PragmaticTheory's link, it is time to move on kNN algorithm.
My "Basic" SVD-like model's best score is RMSE=0.9137 w/ 32 features, K=0.015, Lrate=0.01

Thursday, May 7, 2009

SSE Basic Tutorial

Recently, my netflix program has lousy running performance, I spent a whole day to learn how to program by SSE/SSE2/SIMD(it can add/multiple 4 float point registers at one instruction, in other words, a 4 for loop can be reduced to 1 for loop, awesome!!) and find a lot of good stuffs, of course, I wrote my own at my play ground, really cool technique skill to reduce running time.

If you don't know what is SSE.
SSE at wiki
Intel SSE
SIMD at wiki
Single instruction multiple data (SIMD)

Some SSE examples
SSE examples
Introduction to SSE Programming By Alex Fr
SSE(w/ nice graph for register)
Using Inline Assembly in C/C++
I try the float point code, it is ok for cygwin(gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)), but can't run at Ubuntu(), why?

Pointer and Array
Basic tutorial

But always has disadvantage
By wiki SIMD:
Gathering data into SIMD registers and scattering it to the correct destination locations is tricky and can be inefficient.
Yes, it did, it is really tricky to correct/set/arrange the register/pointer/destination :(
but finally, i set everything right..

Tuesday, May 5, 2009

Open Source Collections for Netflix Prize

C++/C
Linear Regression program of team Gravity
Icefox
Kadence's KNN(base on Icefox)
TD
Turnbeutel(SVD, base on TD)
symbol1
Source code for 0.9046 of netflix forum

Python - played around for couple weeks, very poor performance for SVD, people said need to write an extension by C/C++, ha ha extend by C/C++?, but very concise language :)
Pyflix by Deus ex silicon

Java - curious for java performance
jnetflix

And :)......

NAPOLEON DYNAMITE for Netflix Prize - Part2

By PragmaticTheory(No.1, May/5/2009)

Thread about Napoleon Dynamite at Netflix Prize Forum

Tons of great sharing about this issue, by the way, I like the "ADifferentName"'s post, he gathered a list of valuable information.

Wednesday, April 29, 2009

NAPOLEON DYNAMITE for Netflix Prize

If You Liked This, You’re Sure to Love That
By CLIVE THOMPSON
Published: November 23, 2008, New York Times

The reason, Bertoni says, is that “Napoleon Dynamite” is very weird and very polarizing. It contains a lot of arch, ironic humor, including a famously kooky dance performed by the titular teenage character to help his hapless friend win a student-council election. It’s the type of quirky entertainment that tends to be either loved or despised. The movie has been rated more than two million times in the Netflix database, and the ratings are disproportionately one or five stars.

Mathematically speaking, “Napoleon Dynamite” is a very significant problem for the Netflix Prize.

I draw a matlab graph for this "Napoleon Dynamite" base on my 3 prediction files, it did is a polarizing movie. ha ha, amazing.By the way, in "qualificaiton.txt" file, we have 10551 prediction entries of "Napoleon Dynamite" need to be guessed. In other words, we have 10551 weird behavior kids need to be controlled well, Wow..
If we can’t understand our own preferences, can computers really be any better at it?

Matlab Command
plot(Small_x,svdMix3151(1:100),'g-o',Small_x,userAVG3151(1:100),'r-x',Small_x,movie3151AVG(1:100),'k:');
legend('svd','useraverage','movieaverage');
title('Movie3151(2004,Napoleon Dynamite) first 100 prediction comparison');
xlabel('User Id(not sequential)');
ylabel('Movie Rating - predict value');

Tuesday, April 28, 2009

Tweak SVD V1

today, i spent almost 6 hours to run my tweak SVD algorithm for Netflix prize problem, fortunately, it did a better job than my previous 20 hours SVD algorithm ^_^

MAX_FEATURES = 32
MIN_EPOCHS = 60
Total_Count = 1920
rmse = 0.785448

"qualifying.txt" RMSE = 0.9177
Rank = 952 (April/27/2009)

Thursday, April 23, 2009

Movie 1(2003,Dinosaur Planet) Predict comparisons

MATLAB:
plot(x,movie1_svd,'g',x,useraverage,'r--',x,movieaverage,'k:');
legend('svd','useraverage','movieaverage');
title('Movie1(2003,Dinosaur Planet) prediction comparison');
xlabel('User Id(not sequential)');
ylabel('Movie Rating - predict value');

I don't know why I have this idea to plot this diagram, just want to know the relation of SVD, useraverage, movieaverage.

Python vs C++

After 10 more days experiment, python's poor performance is the reason for my to change the language to C++/C, I found some information about how to write a C extension for python, but it is not easy as the description. But the advantage of python is the concise code style, tons of module to use, it is a double side sword, easy coding make poor running speed. If you need performance, use C, if you need easy coding, use python. It should be my wrong choice at beginning, at least I learn a new language and some experience.

By the way, thanks Turnbeutel and TD's sharing, I ran their code for 22 hours, time consuming project
120 = MIN_EPOCHS
200 = MAX_EPOCHS
64 = FEATURES
3000 counts RMSE = 0.783169(y value)
6000 counts RMSE = 0.7491227
Final RMSE = 0.735715

1 count make 7.5 seconds(By Simon Funk's blog, yes, it real cost around 7.5 seconds, amazing)
min epochs = 120, at least, it will cost 120*7.5 = 900 seconds = 15 minutes for 1 feature
features = 64, so we total need 64*15=960 minutes = 16 hours, but it is the minimal, my wee ibmx60s has 3G ram, 1.6 dual core cpu, spent almost 22 hours to finish the running.

Another method to calculate the total time
total count = 7680, each count spends 7.5s, 7680*7.5=57,600 second = 960 minutes = 16 hours

Tuesday, April 21, 2009

Install Boost at Ubuntu

3.1 Basic Procedure
1.Get Boost; see sections 1 and 2 [Unix/Linux, Windows] of the Boost Getting Started Guide.
2.Get the bjam build driver. See section 5 [Unix/Linux, Windows] of the Boost Getting Started Guide.
3.cd into the libs/python/example/quickstart/ directory of your Boost installation, which contains a small example project.
4.Invoke bjam. Replace the “stage“ argument from the example invocation from section 5 of the Getting Started Guide with “test,“ to build all the test targets. Also add the argument “--verbose-test” to see the output generated by the tests when they are run.

On Windows, your bjam invocation might look something like:
C:\boost_1_34_0\…\quickstart> bjam toolset=msvc --verbose-test test

and on Unix variants, perhaps,
~/boost_1_34_0/…/quickstart$ bjam toolset=gcc --verbose-test test

Monday, April 20, 2009

what is the "steroids" of python?

Psyco Home

Make Python run as fast as C with Psyco

Recently, I try to increase the python running speed, found Psyco module, but don't have the "steroids" effect, or should make a C extension to Python?

Extending and Embedding the Python Interpreter

The Python Tutorial

A Better-Looking Drive

Annika Sorenstam shows Bill Pennington how to drive the ball farther.

Sunday, April 19, 2009

Learning about top wine variety

1.Chardonnay
Chardonnay is a green-skinned grape variety used to make white wine. It is believed to have originated in the Burgundy wine region of eastern France but is now grown wherever wine is produced, from England to New Zealand. For new and developing wine regions, growing Chardonnay is seen as a "rite of passage" and an easy segue into the international wine market.

2.Pinot noir
Pinot noir (IPA: [pi.no.'nwaʁ]) is a red wine grape variety of the species Vitis vinifera. The name may also refer to wines produced predominantly from Pinot noir grapes. The name is derived from the French words for "pine" and "black" alluding to the varietals' tightly clustered dark purple pine cone-shaped bunches of fruit.

3.Merlot
Merlot is a red wine grape that is used as both a blending grape and for varietal wines. Merlot-based wines usually have medium body with hints of berry, plum, and currant. Its softness and "fleshiness", combined with its earlier ripening, makes Merlot a popular grape for blending with the sterner, later-ripening Cabernet Sauvignon, which tends to be higher in tannin.

4.Riesling
Riesling is a white grape variety which originates in the Rhine region of Germany. Riesling is an aromatic grape variety displaying flowery, almost perfumed, aromas as well as high acidity. It is used to make dry, semi-sweet, sweet and sparkling white wines.

5.Shiraz
Syrah is a dark-skinned grape grown throughout the world and used primarily to produce powerful red wines. Syrahs enjoy great popularity in the marketplace, relatively often under the name Shiraz.

6.Cabernet Sauvignon
Cabernet Sauvignon is one of the world's most widely recognized red wine grape varieties. It is grown in nearly every major wine producing country among a diverse spectrum of climates from Canada's Okanagan Valley to Lebanon's Beqaa Valley. Cabernet Sauvignon became internationally recognized through its prominence in Bordeaux wines where it is often blended with Merlot and Cabernet franc.

Saturday, April 18, 2009

Netflix contest Numbers

Total Movie Number = MovieIDs range from 1 to 17,770 sequentially ~ 17 thousands
Total Customer Number = CustomerIDs range from 1 to 2649429, with gaps. There are 480,189 users.
Rating Range = Ratings are on a five star (integral) scale from 1 to 5.

Total Rating Number = 17770*480189 = 8,532,958,530 ~ 8.5 billion entries
Total Training set Number = 100,480,507 ~ 100 million
[The movie rating files contain over 100 million ratings from Netlfix README]
we have 100M entries and 8.4B empty cells, 100millsion/8.5billion ~ 0.01, only have 1% entries of this ginormous 8.5 billion entries matrix

how many entries need to predict in "qualifying.txt"??
the total line of "qualifying.txt" is 2,834,601(needs to minus movieID) ~ 2.8 million

Thursday, April 16, 2009

AI Season 8 - Adam Lambert

Adam Lambert, a name you wouldn't forget for American Idol Season 8, honestly, from my personal perspective, he will be champion of this season, he is in different level comparing with other contester, the biggest difference between he and other competitor is he can sing any style of song, and sing perfectly to suit any type, unmatchable with others. He must be the winner this AI season.

Wednesday, April 15, 2009

SVD - singular value decomposition

thanks Simon Funk's blog, he makes SVD looks like so simple, awesome!!

you can combine pyflix and nullity's SVD code together to make a netflix prize prediction of SVD algorithm, cool, honestly, I think this SVD algorithms can beat a lot of recommendation system in the world.
nullity's SVD on pyflix

Tuesday, April 14, 2009

Netflix Prize Blogs

I summary some very interesting blogs about this recommendation contest..

The rank of Netflix Prize (04/14/2009)

Pragmatic Theory - No.1
We are two simple guys with no academic background in either machine learning or mathematics. We both have bachelors degrees in engineering (software and electrical). We work full time as computer engineers in a totally unrelated field (telecommunications).

Just a guy in a garage - No.19
not an academic or a mathematican. However, being an unemployed psychologist I do have a bit of time.

Simon Funk,Try This at Home - No.193
Simon Funk is a pen name, having obtained a BA from UCSD at the age of 18.

timelydevelopment - No.652
Timely Development is an Atlanta-based provider of software and services.

Monday, April 13, 2009

Pyflix - 04/13/2009

Basic algorithms
====================================================================================
By Bold Raved Tithe
The best I achieved on the probe with predictions removed is:
RMSE 0.9897

For the following equation:
R = -2.58797372 + 0.95682133*M + 0.7606549*U

Where:
R = predicted rating
M = movie average
U = user average
====================================================================================
pyflix
class Mix(MovieAverage,UserAverage):
'''Returns the average of L{MovieAverage} and L{UserAverage}.
This algorithm returns an RMSE score of ?? on the scrubbed dataset.
'''

def __call__(self, movie_id, user_id):
return (
-2.58797372 + 0.95682133*MovieAverage.__call__(self,movie_id,user_id) +
0.7606549*UserAverage.__call__(self,movie_id,user_id)
)
====================================================================================
Check this interesting article
This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize

Sunday, April 12, 2009

Pyflix - 04/12/2009

Pyflix
If you put your algorithm in a module under the algorithms package (e.g. pyflix.algorithms.myalg.py), you can test it on the probe set and get its RMSE by running:

python pyflix/algorithms/run.py myalg.MyFancyAlgorithm path/to/database

===========================================================================
Under my environment

E:\pyflix-0.1>python pyflix/algorithms/run.py average.DoubleAverage e:\pyflix-0.
1\database
E:\pyflix-0.1\pyflix\_path.py:37: DeprecationWarning: the md5 module is deprecat
ed; use hashlib instead
import sys, warnings, os, fnmatch, glob, shutil, codecs, md5
[********************************************************************] 100%
Computed RMSE in 41 minutes and 56.5 seconds
average.DoubleAverage: RMSE=1.0158

===========================================================================
how to generate a prediction file by DoubleAverage algorithm
modify run.py
(1) beginning
from pyflix.algorithms import Algorithm
from pyflix.algorithms.average import DoubleAverage

(2) main function
rmse = timeCall('Computed RMSE', probe_set.rmse, model, progressbar=True)
print '%s.%s: RMSE=%.4f' % (algorithm.__module__, algorithm.__name__, rmse)

a = DoubleAverage(RatedDataset(basedir/'training_set'))
a.writePredictions('E:/np/download/qualifying.txt','DoubleAverage_pre.txt')

(3) winxp cmd
E:\pyflix-0.1>python pyflix/algorithms/run.py average.DoubleAverage e:\pyflix-0.
1\database
E:\pyflix-0.1\pyflix\_path.py:37: DeprecationWarning: the md5 module is deprecat
ed; use hashlib instead
import sys, warnings, os, fnmatch, glob, shutil, codecs, md5
[********************************************************************] 100%
Computed RMSE in 3 minutes and 59.2 seconds
average.DoubleAverage: RMSE=1.0158
[********************************************************************] 100%
==================================================================================
Submit your prediction file
(1) gzip or zip your file
(2) check the MD5
(3) submit your result (my firefox can't submit, only IE can, weird??)

Saturday, April 11, 2009

Pyflix - 04/11/2009

(1) from pyflix.datasets import RatedDataset
you need to be the father directory (in my case, I am at /media/80G/np folder)
(2) then goto python command line
(3) the this "from pyflix.datasets import RatedDataset" will works
=> 奇怪的python directory convention

Basic usage

Pyflix comes with full documentation under docs/api/index.html, created with the excellent Epydoc package. Below is a sample interactive session to demonstrate the dataset API:
======================================================================================
Python 2.4.2 (#1, Mar 8 2006, 13:24:00)
[GCC 3.4.4 20050721 (Red Hat 3.4.4-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> # the main module you'll need is pyflix.datasets
... from pyflix.datasets import RatedDataset

Pyflix - 04/10/2009

Pyflix
Error Message of pyflix
==================================================
Traceback (most recent call last):
File "pyflix/setup.py", line 13, in
from pyflix import timeCall
ImportError: No module named pyflix
==================================================

the better solution should understand the module loading method of python, then change the module name of pyflix

the module name needs to change, my temp solution is

(1) add a new file init.py (which was copied from __init__.py)
(2) change setup.py code
#Old
from pyflix import timeCall
from pyflix._path import path
from pyflix.records import *
from pyflix.progressbar import ProgressBar, progress

#New
from init import timeCall
from _path import path
from records import *
from progressbar import ProgressBar, progress

===========================================================
michael@michael-laptop:/media/80G/np/pyflix$ python setup.py -i /media/80G/np/download/ -o database
* Reading /media/80G/np/download/probe.txt...
[********************************************************************] 100%
* Building rating files...
[********************************************************************] 100%
4544779 training set entries rated 1
9995998 training set entries rated 2
28458811 training set entries rated 3
33288865 training set entries rated 4
22783659 training set entries rated 5
73211 probe set entries rated 1
136082 probe set entries rated 2
352436 probe set entries rated 3
462093 probe set entries rated 4
384573 probe set entries rated 5
Split files in 19 minutes and 49.5 seconds
* Sorting database/training_set/1.tmp, database/training_set/2.tmp, database/training_set/3.tmp, database/training_set/4.tmp, database/training_set/5.tmp...
Sorted files in 4 minutes and 21.5 seconds
* Merging sorted files from database/training_set/sorted_by_movie...
[********************************************************************] 100%
Merged movies in 16 minutes and 40.4 seconds
* Merging sorted files from database/training_set/sorted_by_user...
[********************************************************************] 100%
Merged users in 20 minutes and 18.1 seconds
* Sorting database/probe_set/1.tmp, database/probe_set/2.tmp, database/probe_set/3.tmp, database/probe_set/4.tmp, database/probe_set/5.tmp...
Sorted files in 0 minutes and 3.3 seconds
* Merging sorted files from database/probe_set/sorted_by_movie...
[********************************************************************] 100%
Merged movies in 0 minutes and 19.7 seconds
* Merging sorted files from database/probe_set/sorted_by_user...
[********************************************************************] 100%
Merged users in 3 minutes and 30.0 seconds
* Completed successfully in 65 minutes and 2.5 seconds

Thursday, April 9, 2009

City Data Website

City Data
Stats about all US cities - real estate, relocation info, house prices, home value estimator, recent sales, cost of living, crime, race, income, photos, education, maps, weather, houses, schools, neighborhoods, and more...

Netflix Prize Forum's Summary(4/25/07)

1. Netflix Recommender Framework
Icefox，免費提供c++ source code(我有試過,可是不能RUN,我ㄉ電腦有些它需要軟體好像沒有),而且最近一版ㄉCODE 裡,有提供SVD algorithm (singular value decomposition (SVD))

http://www.netflixprize.com/community/viewtopic.php?id=352
http://sifter.org/~simon/journal/20061211.html (這個是寫SVD 那個人ㄉ網頁)

2. Pyflix - The pythonic Netflix Prize
Deus ex silicon，免費提供python source code

http://www.netflixprize.com/community/viewtopic.php?id=432

3. Useful public RMSEs
一些基本ㄉ猜測ㄉRMSE(像是全猜1或2)
下面這個怪公式ㄉRMSE 可以低於1
For the following equation:
R = -2.58797372 + 0.95682133*M + 0.7606549*U

Where:
R = predicted rating
M = movie average
U = user average

http://www.netflixprize.com/community/viewtopic.php?id=435

4.Not sure how to calculate RMSE
如何計算RMSE

http://www.netflixprize.com/community/viewtopic.php?id=625

5.A better spreadsheet than Excel ?
這個人用EXCEL
http://www.netflixprize.com/community/viewtopic.php?id=483

6.Anyone help with Collaborative
算兩個userㄉ相關

http://www.netflixprize.com/community/viewtopic.php?pid=4924#p4924

7. Ramblings on the lower bound on the RMSE
http://www.netflixprize.com/community/viewtopic.php?id=78

RMSE-------perfect%----random%
0.2887-----100%--------0%
0.8563-----48%---------52%
0.9514-----34%---------66%
0.98--------30%---------70%
1.0540-----18%---------82%
1.1547-----0%---------100%

8.Why RMSE=1.79 for random ratings?
亂猜也有RMSE=1.79

http://www.netflixprize.com/community/viewtopic.php?id=224

Thursday, March 26, 2009

Just love this cute picture.

By Bill Mayer
By ANDREW ROSS SORKIN

Where’s the Plan, Wall Street?

Sunday, March 22, 2009

My younger brother wedding overtune

It has my and my bro little little picture at the beginning.
Both of us are so cute, thanks my papa and mama, they gave us the wonderful,termedous support.

Saturday, March 21, 2009

穿新娘禮服逃跑??

說來奇怪,應該是我本人怪吧
我超愛看有穿新娘禮服逃跑的日本連續劇(大和拜金女[松島菜菜子],求婚大作戰[長澤亞美])
又有新貨到啦
2009 春季新貨(她成功的理由, [堀北真希,黑木明紗]) 耶!!

My first driving range in US

Today we went to Alley Pond Golf driving range(Little Neck, Long Island) w/ my previous colleague Allen, golf enthusiast. Big bucket costs 10 dollar. US is the heaven of sports, I love it.

Friday, March 20, 2009

1 million Euro Car - Lamborghini Reventon

我也要

Watch Lamborghini Reventon vs. Tornado in Autos & Vehicles | View More Free Videos Online at Veoh.com

Wednesday, March 18, 2009

The most famous people in Teacher College, Columbia University

I worked for EdLab 5 months, Pranav(my college) for 4 months, all of us don't know who is the famous people in Teacher College, Columbia.

The most famous people is "John Dewey" 約翰杜威
There is a head stature at the entry 1st floor, we saw John's head stature every morning. How can we don't know who is "John Dewey". Haaaaaaa, so embarrassment.

The difference bettern Times Square and Upper West at Carmine's Restaurant

如果同一家餐廳同時在上東/西跟 Time Square area
請選 Upper West/East !!
我去過 Carmine Restauant 的上西跟 Time Square 的兩家店

Upper East Carmine : 大概是08年的聖誕節前夕, 我們整個 Gottesman Library 的 Stuff 去吃聖誕節前的大餐(吃中午),服務生好到不行,門口還有check coat, 爽度一百

Time Square Carmine : 也大概是08年的聖誕節前夕, 跟 Lynn的姐夫 Ronnie (Texas guy) 那個服務生居然感覺的到 Ronnie 不是 New Yorker(Time Square 太多觀光客), 就問他從哪來, 知道後, 居然還唸了小布希一頓, 也唸了Ronnie 一頓, 說你們德州佬爛之類的話,
臭小子妳真有種阿我可是客人ㄟ