I\'ve been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I\'m looking for tips on what sort of image processing might impr
What was EXTREMLY HELPFUL to me on this way are the source codes for Capture2Text project. http://sourceforge.net/projects/capture2text/files/Capture2Text/.
BTW: Kudos to it's author for sharing such a painstaking algorithm.
Pay special attention to the file Capture2Text\SourceCode\leptonica_util\leptonica_util.c - that's the essence of image preprocession for this utility.
If you will run the binaries, you can check the image transformation before/after the process in Capture2Text\Output\ folder.
P.S. mentioned solution uses Tesseract for OCR and Leptonica for preprocessing.