I'm trying to get Tesseract (using the Tess4J wrapper) to match only a specific pattern. The pattern is four digits in a row, which I think would be \d\d\d\d. Here is a VERY small subset of the image I'm feeding tesseract (the floorplans are restricted, so I'm cautious to post much more of it): http://mike724.com/view/a06771
I'm using the following java code:
File imageFile = new File("/<redacted>/file.pdf");
Tesseract instance = Tesseract.getInstance();
instance.setTessVariable("load_system_dawg", "F");
instance.setTessVariable("load_freq_dawg", "F");
instance.setTessVariable("user_words_suffix", "");
instance.setTessVariable("user_patterns_suffix", "\\d\\d\\d\\d");
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
The problem I'm running into is that tesseract seems to not be honoring these configuration options, I still get text/words in the results. I expect to get only the room numbers (ex. 2950).
You have not configured this correctly.
user_patterns_suffix is meant to indicate the file extension of a text file that contains your patterns, e.g.
user_patterns_suffix pats
would mean you need to put a file in the tesseract tessdata folder
tessdata/eng.pats
... assuming eng was the language you were using.
See more here:
I do recall that user patterns may not be any shorter than 6 fixed chars before a pattern so you may not be able to accomplish this in any case - but try the correct config first.
They look like init-only parameters; as such, they need to be in a configs file, for instance, named bazaar
placed under configs
folder, to be be passed into setConfigs
method.
instance.setConfigs(Arrays.asList("bazaar");
References:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc
https://github.com/tesseract-ocr/tesseract/wiki/ControlParams
http://tess4j.sourceforge.net/docs/docs-1.4/
来源:https://stackoverflow.com/questions/27883090/forcing-tesseract-to-match-pattern-four-digits-in-a-row