What would you recommend for recognizing all characters from a screenshot? The screenshot is perfectly clear (only black text on a white background), also I can choose any s
Since this is the first result on Google for tesseract recognize screenshot
, let me do bit of necromancy and add a much simpler solution.
Tesseract expects images at around 300 dpi or more and standard dpi for Windows is 96. Which means you need to rescale the image to 300%. After that, the results improve dramatically.
100%
Result: Whal would you recommend for recognizing all characters from a screensnor 7
200%
Result: What would you recommend for recognizing all chamcters from a screenth ?
300%
Result: What would you recommend for recognizing all characters from a screenshot ?
Anything above 300% works just as well.
I would be surprised if OCR would give so bad results on such a good quality input. Probably what you want to do is choose a font that has sharp edges, no anti-aliasing, bigger font size would also help.
Also, if acceptable, try the OCR font given in this SO question:
This should give you the best possible results - if this doesn't go 100%, then I don't know what will...
Don't know what you tried beside Tesseract, but if you did not, it might be worth trying some others. These seem to be updated recently (Tesseract was updated a year ago):
There are some online versions, too, such as:
that you can use to test a sample document. From this link:
it seems that you might need to go commercial to get what you want.
Hope this helps.
You can use Abby Fine Reader 12.0 for text extraction from PDF's and or Screenshot Images and directly save them into your desired file format.
See through: Abby Fine Reader 15 - Free Trial
Do you have the option to change text anti-aliasing on the OS level? Playing around with those settings (or even trying to turn it off) might give you better result with existing OCRs too.
I know you already solved your problem, but in case this helps someone else: Two issues I found when dealing with screenshots is that OCR engines are sensitive to the following: (1) resolution incorrectly set in image file headers, and (2) transparency issues (what looks like white background is actually marked transparent). For some reason these problems tend to occur often in screenshot images.
Also, aside from Tesseract, another possibility is to try the API at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml based on the ABBYY OCR engine. (The advantage is that there's nothing to install/configure/etc to try it to make sure it will work on your images - just make an HTTP POST). Disclaimer: WiseTrend is my company's customer.