问题
I'm using tesseract to recognize a serial number. This works acceptable, common problem like false recognition of zero and "O", 6 and 5, or M and H exists. Beside by this tesseract adds spaces to the recognized words, where no space is in the image. The following image is recognized as "HI 3H".
This image results in " FBKHJ 1R1"
So tesseract added a space, although there isn't really a space in the image. Is there a possibility parametrize the spacing behavior of tesseract?
Edit
I'm sorry, have forgot to add, that I also have serial numbers which include spaces. So I cannot delete all spaces inside the recognized serial number.
For example the following image containing a space in the serial number results after tesseract recognition into: J4 F1583BB. Beside that the recognition of the characters is false, the space is recognized correct with this image.
My actual parameters for tesseract are:
tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);
tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
tess.SetVariable("tessedit_char_whitelist",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ012345789");
char* out = tess.GetUTF8Text();
string text = string(out);
Edit
It is notices from already existing answers, that the space between the "J" and "I" for example seems to be little more, than between the other characters. The font-type I have chosen is a Monotype Font. Reason for this is that I thought, that this helps tesseract for character recognition. Drawback of such a Monospace font-type, where every character has the same width, is that the kernel (the space between the characters) varies. See example image of following source Source
Which font type do you think, will achieve better recognition results?
回答1:
Adjusting parameter tosp_min_sane_kn_sp
may help. I solved the problem by doing it.
If it doesn't help, you may try other tosp_*
paramters, or working around the space source code "tospace.cpp"
回答2:
I'm not C++ programmer but i think that it's possible to calibrate the width of each letter space. I found this parameter "textord_space_size_is_variable" in this site, and it says "If true, word delimiter spaces are assumed to have variable width, even though characters have fixed pitch."
Good luck! :)
来源:https://stackoverflow.com/questions/31072452/tesseract-false-space-recognition