Michael's Blog: April 2009

Wednesday, April 29, 2009

NAPOLEON DYNAMITE for Netflix Prize

If You Liked This, You’re Sure to Love That
By CLIVE THOMPSON
Published: November 23, 2008, New York Times

The reason, Bertoni says, is that “Napoleon Dynamite” is very weird and very polarizing. It contains a lot of arch, ironic humor, including a famously kooky dance performed by the titular teenage character to help his hapless friend win a student-council election. It’s the type of quirky entertainment that tends to be either loved or despised. The movie has been rated more than two million times in the Netflix database, and the ratings are disproportionately one or five stars.

Mathematically speaking, “Napoleon Dynamite” is a very significant problem for the Netflix Prize.

I draw a matlab graph for this "Napoleon Dynamite" base on my 3 prediction files, it did is a polarizing movie. ha ha, amazing.By the way, in "qualificaiton.txt" file, we have 10551 prediction entries of "Napoleon Dynamite" need to be guessed. In other words, we have 10551 weird behavior kids need to be controlled well, Wow..
If we can’t understand our own preferences, can computers really be any better at it?

Matlab Command
plot(Small_x,svdMix3151(1:100),'g-o',Small_x,userAVG3151(1:100),'r-x',Small_x,movie3151AVG(1:100),'k:');
legend('svd','useraverage','movieaverage');
title('Movie3151(2004,Napoleon Dynamite) first 100 prediction comparison');
xlabel('User Id(not sequential)');
ylabel('Movie Rating - predict value');

Tuesday, April 28, 2009

Tweak SVD V1

today, i spent almost 6 hours to run my tweak SVD algorithm for Netflix prize problem, fortunately, it did a better job than my previous 20 hours SVD algorithm ^_^

MAX_FEATURES = 32
MIN_EPOCHS = 60
Total_Count = 1920
rmse = 0.785448

"qualifying.txt" RMSE = 0.9177
Rank = 952 (April/27/2009)

Thursday, April 23, 2009

Movie 1(2003,Dinosaur Planet) Predict comparisons

MATLAB:
plot(x,movie1_svd,'g',x,useraverage,'r--',x,movieaverage,'k:');
legend('svd','useraverage','movieaverage');
title('Movie1(2003,Dinosaur Planet) prediction comparison');
xlabel('User Id(not sequential)');
ylabel('Movie Rating - predict value');

I don't know why I have this idea to plot this diagram, just want to know the relation of SVD, useraverage, movieaverage.

Python vs C++

After 10 more days experiment, python's poor performance is the reason for my to change the language to C++/C, I found some information about how to write a C extension for python, but it is not easy as the description. But the advantage of python is the concise code style, tons of module to use, it is a double side sword, easy coding make poor running speed. If you need performance, use C, if you need easy coding, use python. It should be my wrong choice at beginning, at least I learn a new language and some experience.

By the way, thanks Turnbeutel and TD's sharing, I ran their code for 22 hours, time consuming project
120 = MIN_EPOCHS
200 = MAX_EPOCHS
64 = FEATURES
3000 counts RMSE = 0.783169(y value)
6000 counts RMSE = 0.7491227
Final RMSE = 0.735715

1 count make 7.5 seconds(By Simon Funk's blog, yes, it real cost around 7.5 seconds, amazing)
min epochs = 120, at least, it will cost 120*7.5 = 900 seconds = 15 minutes for 1 feature
features = 64, so we total need 64*15=960 minutes = 16 hours, but it is the minimal, my wee ibmx60s has 3G ram, 1.6 dual core cpu, spent almost 22 hours to finish the running.

Another method to calculate the total time
total count = 7680, each count spends 7.5s, 7680*7.5=57,600 second = 960 minutes = 16 hours

Tuesday, April 21, 2009

Install Boost at Ubuntu

3.1 Basic Procedure
1.Get Boost; see sections 1 and 2 [Unix/Linux, Windows] of the Boost Getting Started Guide.
2.Get the bjam build driver. See section 5 [Unix/Linux, Windows] of the Boost Getting Started Guide.
3.cd into the libs/python/example/quickstart/ directory of your Boost installation, which contains a small example project.
4.Invoke bjam. Replace the “stage“ argument from the example invocation from section 5 of the Getting Started Guide with “test,“ to build all the test targets. Also add the argument “--verbose-test” to see the output generated by the tests when they are run.

On Windows, your bjam invocation might look something like:
C:\boost_1_34_0\…\quickstart> bjam toolset=msvc --verbose-test test

and on Unix variants, perhaps,
~/boost_1_34_0/…/quickstart$ bjam toolset=gcc --verbose-test test

Monday, April 20, 2009

what is the "steroids" of python?

Psyco Home

Make Python run as fast as C with Psyco

Recently, I try to increase the python running speed, found Psyco module, but don't have the "steroids" effect, or should make a C extension to Python?

Extending and Embedding the Python Interpreter

The Python Tutorial

A Better-Looking Drive

Annika Sorenstam shows Bill Pennington how to drive the ball farther.

Sunday, April 19, 2009

Learning about top wine variety

1.Chardonnay
Chardonnay is a green-skinned grape variety used to make white wine. It is believed to have originated in the Burgundy wine region of eastern France but is now grown wherever wine is produced, from England to New Zealand. For new and developing wine regions, growing Chardonnay is seen as a "rite of passage" and an easy segue into the international wine market.

2.Pinot noir
Pinot noir (IPA: [pi.no.'nwaʁ]) is a red wine grape variety of the species Vitis vinifera. The name may also refer to wines produced predominantly from Pinot noir grapes. The name is derived from the French words for "pine" and "black" alluding to the varietals' tightly clustered dark purple pine cone-shaped bunches of fruit.

3.Merlot
Merlot is a red wine grape that is used as both a blending grape and for varietal wines. Merlot-based wines usually have medium body with hints of berry, plum, and currant. Its softness and "fleshiness", combined with its earlier ripening, makes Merlot a popular grape for blending with the sterner, later-ripening Cabernet Sauvignon, which tends to be higher in tannin.

4.Riesling
Riesling is a white grape variety which originates in the Rhine region of Germany. Riesling is an aromatic grape variety displaying flowery, almost perfumed, aromas as well as high acidity. It is used to make dry, semi-sweet, sweet and sparkling white wines.

5.Shiraz
Syrah is a dark-skinned grape grown throughout the world and used primarily to produce powerful red wines. Syrahs enjoy great popularity in the marketplace, relatively often under the name Shiraz.

6.Cabernet Sauvignon
Cabernet Sauvignon is one of the world's most widely recognized red wine grape varieties. It is grown in nearly every major wine producing country among a diverse spectrum of climates from Canada's Okanagan Valley to Lebanon's Beqaa Valley. Cabernet Sauvignon became internationally recognized through its prominence in Bordeaux wines where it is often blended with Merlot and Cabernet franc.

Saturday, April 18, 2009

Netflix contest Numbers

Total Movie Number = MovieIDs range from 1 to 17,770 sequentially ~ 17 thousands
Total Customer Number = CustomerIDs range from 1 to 2649429, with gaps. There are 480,189 users.
Rating Range = Ratings are on a five star (integral) scale from 1 to 5.

Total Rating Number = 17770*480189 = 8,532,958,530 ~ 8.5 billion entries
Total Training set Number = 100,480,507 ~ 100 million
[The movie rating files contain over 100 million ratings from Netlfix README]
we have 100M entries and 8.4B empty cells, 100millsion/8.5billion ~ 0.01, only have 1% entries of this ginormous 8.5 billion entries matrix

how many entries need to predict in "qualifying.txt"??
the total line of "qualifying.txt" is 2,834,601(needs to minus movieID) ~ 2.8 million

Thursday, April 16, 2009

AI Season 8 - Adam Lambert

Adam Lambert, a name you wouldn't forget for American Idol Season 8, honestly, from my personal perspective, he will be champion of this season, he is in different level comparing with other contester, the biggest difference between he and other competitor is he can sing any style of song, and sing perfectly to suit any type, unmatchable with others. He must be the winner this AI season.

Wednesday, April 15, 2009

SVD - singular value decomposition

thanks Simon Funk's blog, he makes SVD looks like so simple, awesome!!

you can combine pyflix and nullity's SVD code together to make a netflix prize prediction of SVD algorithm, cool, honestly, I think this SVD algorithms can beat a lot of recommendation system in the world.
nullity's SVD on pyflix

Tuesday, April 14, 2009

Netflix Prize Blogs

I summary some very interesting blogs about this recommendation contest..

The rank of Netflix Prize (04/14/2009)

Pragmatic Theory - No.1
We are two simple guys with no academic background in either machine learning or mathematics. We both have bachelors degrees in engineering (software and electrical). We work full time as computer engineers in a totally unrelated field (telecommunications).

Just a guy in a garage - No.19
not an academic or a mathematican. However, being an unemployed psychologist I do have a bit of time.

Simon Funk,Try This at Home - No.193
Simon Funk is a pen name, having obtained a BA from UCSD at the age of 18.

timelydevelopment - No.652
Timely Development is an Atlanta-based provider of software and services.

Monday, April 13, 2009

Pyflix - 04/13/2009

Basic algorithms
====================================================================================
By Bold Raved Tithe
The best I achieved on the probe with predictions removed is:
RMSE 0.9897

For the following equation:
R = -2.58797372 + 0.95682133*M + 0.7606549*U

Where:
R = predicted rating
M = movie average
U = user average
====================================================================================
pyflix
class Mix(MovieAverage,UserAverage):
'''Returns the average of L{MovieAverage} and L{UserAverage}.
This algorithm returns an RMSE score of ?? on the scrubbed dataset.
'''

def __call__(self, movie_id, user_id):
return (
-2.58797372 + 0.95682133*MovieAverage.__call__(self,movie_id,user_id) +
0.7606549*UserAverage.__call__(self,movie_id,user_id)
)
====================================================================================
Check this interesting article
This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize

Sunday, April 12, 2009

Pyflix - 04/12/2009

Pyflix
If you put your algorithm in a module under the algorithms package (e.g. pyflix.algorithms.myalg.py), you can test it on the probe set and get its RMSE by running:

python pyflix/algorithms/run.py myalg.MyFancyAlgorithm path/to/database

===========================================================================
Under my environment

E:\pyflix-0.1>python pyflix/algorithms/run.py average.DoubleAverage e:\pyflix-0.
1\database
E:\pyflix-0.1\pyflix\_path.py:37: DeprecationWarning: the md5 module is deprecat
ed; use hashlib instead
import sys, warnings, os, fnmatch, glob, shutil, codecs, md5
[********************************************************************] 100%
Computed RMSE in 41 minutes and 56.5 seconds
average.DoubleAverage: RMSE=1.0158

===========================================================================
how to generate a prediction file by DoubleAverage algorithm
modify run.py
(1) beginning
from pyflix.algorithms import Algorithm
from pyflix.algorithms.average import DoubleAverage

(2) main function
rmse = timeCall('Computed RMSE', probe_set.rmse, model, progressbar=True)
print '%s.%s: RMSE=%.4f' % (algorithm.__module__, algorithm.__name__, rmse)

a = DoubleAverage(RatedDataset(basedir/'training_set'))
a.writePredictions('E:/np/download/qualifying.txt','DoubleAverage_pre.txt')

(3) winxp cmd
E:\pyflix-0.1>python pyflix/algorithms/run.py average.DoubleAverage e:\pyflix-0.
1\database
E:\pyflix-0.1\pyflix\_path.py:37: DeprecationWarning: the md5 module is deprecat
ed; use hashlib instead
import sys, warnings, os, fnmatch, glob, shutil, codecs, md5
[********************************************************************] 100%
Computed RMSE in 3 minutes and 59.2 seconds
average.DoubleAverage: RMSE=1.0158
[********************************************************************] 100%
==================================================================================
Submit your prediction file
(1) gzip or zip your file
(2) check the MD5
(3) submit your result (my firefox can't submit, only IE can, weird??)

Saturday, April 11, 2009

Pyflix - 04/11/2009

(1) from pyflix.datasets import RatedDataset
you need to be the father directory (in my case, I am at /media/80G/np folder)
(2) then goto python command line
(3) the this "from pyflix.datasets import RatedDataset" will works
=> 奇怪的python directory convention

Basic usage

Pyflix comes with full documentation under docs/api/index.html, created with the excellent Epydoc package. Below is a sample interactive session to demonstrate the dataset API:
======================================================================================
Python 2.4.2 (#1, Mar 8 2006, 13:24:00)
[GCC 3.4.4 20050721 (Red Hat 3.4.4-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> # the main module you'll need is pyflix.datasets
... from pyflix.datasets import RatedDataset

Pyflix - 04/10/2009

Pyflix
Error Message of pyflix
==================================================
Traceback (most recent call last):
File "pyflix/setup.py", line 13, in
from pyflix import timeCall
ImportError: No module named pyflix
==================================================

the better solution should understand the module loading method of python, then change the module name of pyflix

the module name needs to change, my temp solution is

(1) add a new file init.py (which was copied from __init__.py)
(2) change setup.py code
#Old
from pyflix import timeCall
from pyflix._path import path
from pyflix.records import *
from pyflix.progressbar import ProgressBar, progress

#New
from init import timeCall
from _path import path
from records import *
from progressbar import ProgressBar, progress

===========================================================
michael@michael-laptop:/media/80G/np/pyflix$ python setup.py -i /media/80G/np/download/ -o database
* Reading /media/80G/np/download/probe.txt...
[********************************************************************] 100%
* Building rating files...
[********************************************************************] 100%
4544779 training set entries rated 1
9995998 training set entries rated 2
28458811 training set entries rated 3
33288865 training set entries rated 4
22783659 training set entries rated 5
73211 probe set entries rated 1
136082 probe set entries rated 2
352436 probe set entries rated 3
462093 probe set entries rated 4
384573 probe set entries rated 5
Split files in 19 minutes and 49.5 seconds
* Sorting database/training_set/1.tmp, database/training_set/2.tmp, database/training_set/3.tmp, database/training_set/4.tmp, database/training_set/5.tmp...
Sorted files in 4 minutes and 21.5 seconds
* Merging sorted files from database/training_set/sorted_by_movie...
[********************************************************************] 100%
Merged movies in 16 minutes and 40.4 seconds
* Merging sorted files from database/training_set/sorted_by_user...
[********************************************************************] 100%
Merged users in 20 minutes and 18.1 seconds
* Sorting database/probe_set/1.tmp, database/probe_set/2.tmp, database/probe_set/3.tmp, database/probe_set/4.tmp, database/probe_set/5.tmp...
Sorted files in 0 minutes and 3.3 seconds
* Merging sorted files from database/probe_set/sorted_by_movie...
[********************************************************************] 100%
Merged movies in 0 minutes and 19.7 seconds
* Merging sorted files from database/probe_set/sorted_by_user...
[********************************************************************] 100%
Merged users in 3 minutes and 30.0 seconds
* Completed successfully in 65 minutes and 2.5 seconds

Thursday, April 9, 2009

City Data Website

City Data
Stats about all US cities - real estate, relocation info, house prices, home value estimator, recent sales, cost of living, crime, race, income, photos, education, maps, weather, houses, schools, neighborhoods, and more...

Netflix Prize Forum's Summary(4/25/07)

1. Netflix Recommender Framework
Icefox，免費提供c++ source code(我有試過,可是不能RUN,我ㄉ電腦有些它需要軟體好像沒有),而且最近一版ㄉCODE 裡,有提供SVD algorithm (singular value decomposition (SVD))

http://www.netflixprize.com/community/viewtopic.php?id=352
http://sifter.org/~simon/journal/20061211.html (這個是寫SVD 那個人ㄉ網頁)

2. Pyflix - The pythonic Netflix Prize
Deus ex silicon，免費提供python source code

http://www.netflixprize.com/community/viewtopic.php?id=432

3. Useful public RMSEs
一些基本ㄉ猜測ㄉRMSE(像是全猜1或2)
下面這個怪公式ㄉRMSE 可以低於1
For the following equation:
R = -2.58797372 + 0.95682133*M + 0.7606549*U

Where:
R = predicted rating
M = movie average
U = user average

http://www.netflixprize.com/community/viewtopic.php?id=435

4.Not sure how to calculate RMSE
如何計算RMSE

http://www.netflixprize.com/community/viewtopic.php?id=625

5.A better spreadsheet than Excel ?
這個人用EXCEL
http://www.netflixprize.com/community/viewtopic.php?id=483

6.Anyone help with Collaborative
算兩個userㄉ相關

http://www.netflixprize.com/community/viewtopic.php?pid=4924#p4924

7. Ramblings on the lower bound on the RMSE
http://www.netflixprize.com/community/viewtopic.php?id=78

RMSE-------perfect%----random%
0.2887-----100%--------0%
0.8563-----48%---------52%
0.9514-----34%---------66%
0.98--------30%---------70%
1.0540-----18%---------82%
1.1547-----0%---------100%

8.Why RMSE=1.79 for random ratings?
亂猜也有RMSE=1.79

http://www.netflixprize.com/community/viewtopic.php?id=224

Michael's Blog