Doing OCR on linux/Mac
Posted by Tariq • Monday, July 27. 2009 • Category: Programming, Tidbits
Yesterday somebody gave me a USB key with ~1000 JPEGs on it. Each JPEG was a scanned page, ugh, and the task was to find some useful information about topic X. Now each JPEG was about 1-2mb and I needed to do something useful with these images quickly. So what follows is quick walk through of how to do Optical Character Recognition (OCR, that means taking silly image files and ripping out any text identified in them) on Linux or in my case a Mac.
You may be interested to read on if:
Luckily the tools we need are already out there on the web so we don't need to get our hands dirty with coding. A search for OCR programs revealed tesseract-ocr. The program
Installing tesseract-ocr
The program should now be installed. Next download and install the language data files. Download the proper language files; they're in the format tesseract-VERSION.LANG.tar.gz and extract the contents as we did in step 2 above. This should leave you with a directory called tessdata. Move this directory to
Installing unpaper
This is a great little program for making those scanned images nice and neat; e.g. it correctly orientates your text, gets rid of bad artifacts, and lots of other stuff. I downloaded the packages at the unpaper website and did the same as steps 2,3,5, and 6 in the installing tesseract-ocr section above.
Doing the cool stuff
Here is the work flow:
Luckily I wrote a script to do all this.
Run it like so:
./ocr.sh outputfile *.JPG
If you're into forensics you could easily do something like
You may be interested to read on if:
- You are forensic investigator. It is pretty common for an investigator to use a program like
stringsto create dictionaries for cracking passwords. Are you also adding text you find in images to those dictionaries? - You receive a ton of faxes each day. Haven't done this yet, but imagine if faxes came in, got printed and then an image was stored on a NAS somewhere. If you had some cron script which processed those images, pulling out: sender, date, and text; then popped the lot a database somewhere, that would be awesome!
- You hate the idea of sending images of text by email.
Luckily the tools we need are already out there on the web so we don't need to get our hands dirty with coding. A search for OCR programs revealed tesseract-ocr. The program
tesseract (more on the name) is in a word, awesome. Installing tesseract-ocr
- Download tesseract-ocr.
- Extract with
tar xzf tesseract-X.XX.tar.gz - Enter the newly created tesseract directory;
cd tesseract-X.XX - Run:
./configure - Run:
./make - Run:
sudo make install
The program should now be installed. Next download and install the language data files. Download the proper language files; they're in the format tesseract-VERSION.LANG.tar.gz and extract the contents as we did in step 2 above. This should leave you with a directory called tessdata. Move this directory to
/usr/local/share/, so that you should now have a directory called /usr/local/share/tessdata/You may also want to download and installtiffbefore all that. On the Mac I think I usedsudo port install tiff
Installing unpaper
This is a great little program for making those scanned images nice and neat; e.g. it correctly orientates your text, gets rid of bad artifacts, and lots of other stuff. I downloaded the packages at the unpaper website and did the same as steps 2,3,5, and 6 in the installing tesseract-ocr section above.
Another indispensable program you will need is Imagemagick. Surely it is installable with eithersudo port install imagemagickon Mac orsudo apt-get install imagemagickon Debian Linux.
When you see the commandconvertbelow this is a bit of Imagemagick!
Doing the cool stuff
Here is the work flow:
- Get your images into a suitable format using convert.
- If you're using unpaper to clean up your file then we need to run this. COSTLY; all operations took ~40 seconds per page compared to ~7 seconds without this step (on my laptop). If your images are well scanned then you can ignore this step.
- Feed the output to
tesseract - Read/do stuff with the output file.
Doing a bit ofunpapering gives better results, but I could live with the odd mispelling or extra character here and there. I couldn't live with how long the process took though. Probably only worth it if you have terribly scanned images or your extremely picky.
Luckily I wrote a script to do all this.
Run it like so:
./ocr.sh outputfile *.JPG
If you're into forensics you could easily do something like
find / ... -exec ./ocr.sh /imagestrings.txt {} \;; although, you may want to add a cleanup routine. That's all for now.Defined tags for this entry: bash, computer forensics, dictionary, forensics, imagemagick, linux, mac, ocr, password, programming, scanned, script, strings, tesseract, tidbits, unpaper
0 Comments
Add Comment