Recently, I am playing around the OCR(Optical character recognition)
http://en.wikipedia.org/wiki/Optical_character_recognition
tesseract is a good open source that I play first, but it lacks documents, a little bit annoying. Some people asked how to let tesseract to only recognize "Digits", you may find some hints at FAQ of tesseract's wiki or README, but I shared what I found.
Environment : Ubuntu 8.04
1.Add a file (digit) to /usr/local/share/tessdata/configs/
2."a file(digit)"
filename : digit
file content : tessedit_char_whitelist 0123456789
3.Change your tesseract command as below
Ex:
./tesseract ~/image.tif ~/output nobatch digits
Have Fun, honestly, not so good to recognize "DIGITS" as I thought
Monday, May 3, 2010
Subscribe to:
Post Comments (Atom)
Hey , I m webmaster From Yantram BPO Pvt Ltd. I like your Blog Information its
ReplyDeleteTruly Good and Informative As Well. We Also Provide Data Entry Services If you want to Discuss anything about Data Entry then you can Contact us On This Website
http://data-entry.outsourcing-services-india.com
What is the "best" font for text recongition by tesseract? I can choose how I print these out, but I'm finding that some fonts work better than others.pearson correlation
ReplyDeleteI like it! Thanks for publishing.
ReplyDeletewhat about training the digits to a certain font to improvement accuracy to that font?
ReplyDeleteIts so helpfull..
ReplyDeleteThanks.