Monday, May 3, 2010

How to let tesseract (OCR) to only recognize "Digits" ?

Recently, I am playing around the OCR(Optical character recognition)
http://en.wikipedia.org/wiki/Optical_character_recognition

tesseract is a good open source that I play first, but it lacks documents, a little bit annoying. Some people asked how to let tesseract to only recognize "Digits", you may find some hints at FAQ of tesseract's wiki or README, but I shared what I found.

Environment : Ubuntu 8.04
1.Add a file (digit) to /usr/local/share/tessdata/configs/
2."a file(digit)"
filename : digit
file content : tessedit_char_whitelist 0123456789
3.Change your tesseract command as below
Ex:
./tesseract ~/image.tif ~/output nobatch digits

Have Fun, honestly, not so good to recognize "DIGITS" as I thought

5 comments:

  1. Hey , I m webmaster From Yantram BPO Pvt Ltd. I like your Blog Information its
    Truly Good and Informative As Well. We Also Provide Data Entry Services If you want to Discuss anything about Data Entry then you can Contact us On This Website

    http://data-entry.outsourcing-services-india.com

    ReplyDelete
  2. What is the "best" font for text recongition by tesseract? I can choose how I print these out, but I'm finding that some fonts work better than others.pearson correlation

    ReplyDelete
  3. I like it! Thanks for publishing.

    ReplyDelete
  4. what about training the digits to a certain font to improvement accuracy to that font?

    ReplyDelete