Monday, May 3, 2010

How to let tesseract (OCR) to only recognize "Digits" ?

Recently, I am playing around the OCR(Optical character recognition)
http://en.wikipedia.org/wiki/Optical_character_recognition

tesseract is a good open source that I play first, but it lacks documents, a little bit annoying. Some people asked how to let tesseract to only recognize "Digits", you may find some hints at FAQ of tesseract's wiki or README, but I shared what I found.

Environment : Ubuntu 8.04
1.Add a file (digit) to /usr/local/share/tessdata/configs/
2."a file(digit)"
filename : digit
file content : tessedit_char_whitelist 0123456789
3.Change your tesseract command as below
Ex:
./tesseract ~/image.tif ~/output nobatch digits

Have Fun, honestly, not so good to recognize "DIGITS" as I thought