Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

后端 未结 2 711
伪装坚强ぢ
伪装坚强ぢ 2020-12-17 01:03

In the Tesseract FAQ they say you can:

How can I get the coordinates and confidence of each character?

There are two options. If

相关标签:
2条回答
  • 2020-12-17 01:17

    You've seen it: it isn't there.

    So you can either modify Tesseract source code to output hOCR format that supports x_confs property that you want or use its ResultIterator API class to get confidence at the character (symbol) level (be sure to SetVariable("save_blob_choices", "T") after Init method).

    0 讨论(0)
  • 2020-12-17 01:22

    This now seems to be available in Tesseract 4.x.

    See my answer at:

    https://stackoverflow.com/a/57766860/1021819

    Set hocr_char_boxes to 1 in your config file. Or, at the command line, your updated command would be:

    tesseract [Image name] outputbase --oem 1 -l eng --psm 8 -c hocr_char_boxes=1 hocr Note the hocr output option and look in that file for ..._wconf, e.g.

    Let me know if this works for you, otherwise I'll just delete the answer.

    Source: https://github.com/tesseract-ocr/tesseract/issues/1465#issuecomment-513139976

    0 讨论(0)
提交回复
热议问题