Recently, my netflix program has lousy running performance, I spent a whole day to learn how to program by SSE/SSE2/SIMD(it can add/multiple 4 float point registers at one instruction, in other words, a 4 for loop can be reduced to 1 for loop, awesome!!) and find a lot of good stuffs, of course, I wrote my own at my play ground, really cool technique skill to reduce running time.
If you don't know what is SSE.
SSE at wiki
SIMD at wiki
Single instruction multiple data (SIMD)
Some SSE examples
Introduction to SSE Programming By Alex Fr
SSE(w/ nice graph for register)
Using Inline Assembly in C/C++
I try the float point code, it is ok for cygwin(gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)), but can't run at Ubuntu(), why?
Pointer and Array
But always has disadvantage
By wiki SIMD:
Gathering data into SIMD registers and scattering it to the correct destination locations is tricky and can be inefficient.
Yes, it did, it is really tricky to correct/set/arrange the register/pointer/destination :(
but finally, i set everything right..