Thursday, March 13, 2008

Tesseract OCR

Tesseract is a very powerful OCR package that works only from the command line. Imagemagick is a very powerful image conversion toolkit. To OCR a PDF:

convert inputfile.pdf covertedfile.tiff
tesseract covertedfile.tiff textfile
. That's amazingly easy.

No comments: