Gl.ib.ly

(glibly); Just another techie blog.

Doing OCR on linux/Mac

Posted by Tariq • Monday, July 27. 2009 • Category: Programming, Tidbits
Yesterday somebody gave me a USB key with ~1000 JPEGs on it. Each JPEG was a scanned page, ugh, and the task was to find some useful information about topic X. Now each JPEG was about 1-2mb and I needed to do something useful with these images quickly. So what follows is quick walk through of how to do Optical Character Recognition (OCR, that means taking silly image files and ripping out any text identified in them) on Linux or in my case a Mac.

You may be interested to read on if:
  1. You are forensic investigator. It is pretty common for an investigator to use a program like strings to create dictionaries for cracking passwords. Are you also adding text you find in images to those dictionaries?
  2. You receive a ton of faxes each day. Haven't done this yet, but imagine if faxes came in, got printed and then an image was stored on a NAS somewhere. If you had some cron script which processed those images, pulling out: sender, date, and text; then popped the lot a database somewhere, that would be awesome!
  3. You hate the idea of sending images of text by email.


Luckily the tools we need are already out there on the web so we don't need to get our hands dirty with coding. A search for OCR programs revealed tesseract-ocr. The program tesseract (more on the name) is in a word, awesome.

Installing tesseract-ocr
  1. Download tesseract-ocr.
  2. Extract with tar xzf tesseract-X.XX.tar.gz
  3. Enter the newly created tesseract directory; cd tesseract-X.XX
  4. Run: ./configure
  5. Run: ./make
  6. Run: sudo make install


The program should now be installed. Next download and install the language data files. Download the proper language files; they're in the format tesseract-VERSION.LANG.tar.gz and extract the contents as we did in step 2 above. This should leave you with a directory called tessdata. Move this directory to /usr/local/share/, so that you should now have a directory called /usr/local/share/tessdata/

You may also want to download and install tiff before all that. On the Mac I think I used sudo port install tiff


Installing unpaper
This is a great little program for making those scanned images nice and neat; e.g. it correctly orientates your text, gets rid of bad artifacts, and lots of other stuff. I downloaded the packages at the unpaper website and did the same as steps 2,3,5, and 6 in the installing tesseract-ocr section above.

Another indispensable program you will need is Imagemagick. Surely it is installable with either sudo port install imagemagick on Mac or sudo apt-get install imagemagick on Debian Linux.

When you see the command convert below this is a bit of Imagemagick!


Doing the cool stuff

Here is the work flow:
  1. Get your images into a suitable format using convert.
  2. If you're using unpaper to clean up your file then we need to run this. COSTLY; all operations took ~40 seconds per page compared to ~7 seconds without this step (on my laptop). If your images are well scanned then you can ignore this step.
  3. Feed the output to tesseract
  4. Read/do stuff with the output file.



Doing a bit of unpapering gives better results, but I could live with the odd mispelling or extra character here and there. I couldn't live with how long the process took though. Probably only worth it if you have terribly scanned images or your extremely picky.


Luckily I wrote a script to do all this.


Run it like so:

./ocr.sh outputfile *.JPG
 


If you're into forensics you could easily do something like find / ... -exec ./ocr.sh /imagestrings.txt {} \;; although, you may want to add a cleanup routine. That's all for now.


Vote for articles fresher than 90 days!
Current karma: 3.08 of 5, 12 vote(s)
Bookmark Doing OCR on linux/Mac  at del.icio.us Digg Doing OCR on linux/Mac Bloglines Doing OCR on linux/Mac Technorati Doing OCR on linux/Mac Fark this: Doing OCR on linux/Mac Bookmark Doing OCR on linux/Mac  at YahooMyWeb Bookmark Doing OCR on linux/Mac  at Furl.net Bookmark Doing OCR on linux/Mac  at blogmarks Stumble It!

0 Trackbacks

  1. No Trackbacks

0 Comments

Display comments as (Linear | Threaded)
  1. No comments

Add Comment


Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

You can use [geshi lang=lang_name [,ln={y|n}]][/geshi] tags to embed source code snippets.