Page layout analysis using Tesseract?

后端未结

关注

 4  1832

Tesseract 3 is able to perform page layout analysis. However, I couldn\'t find any sample code or documentation on how to use the library for such purposes. I hope someone h

相关标签:

4条回答

一整个雨季

2020-12-23 11:16

Not sure if this exactly answers your question, but I landed here looking for ways to get the bbox-coordinates info (and text recognised inside the bbox optionally) given an input image. The solution to which is now possible using tesseract.

$> tesseract test.tiff test.txt -l eng -psm 1 tsv

The params to notice in above code-snippet are 'psm' and 'tsv'. 'psm' selects the page segmentation mode and 'tsv' generates a nice tabular output file with all the information (page-block-line number, bbox coods, confidence, predicted text) you'd need on your text-image (shown below)

level   page_num    block_num   par_num line_num    word_num    left    top width   height  conf    text
1   1   0   0   0   0   0   0   5500    4250    -1
2   1   1   0   0   0   327 285 2218    53  -1
3   1   1   1   0   0   327 285 2218    53  -1
4   1   1   1   1   0   327 285 2218    53  -1
5   1   1   1   1   1   327 285 246 38  87  INFOPAC
5   1   1   1   1   2   620 287 165 38  87  PAGE
5   1   1   1   1   3   952 290 100 37  95  NAME
5   1   1   1   1   4   1173    292 1082    45  39  ENTRYDATE
5   1   1   1   1   5   2333    302 212 36  48  EMAIL

0 讨论(0)

青春惊慌失措

2020-12-23 11:23

First, initialize TessBaseAPI instance. You can either use Init() (if you want to perform further text recognition) or InitForAnalysePage() (if you're interested just in text boxes).

Second, set the image using SetImage().

And finally, call AnalyseLayout() to get PageIterator which provides you with text boxes.

tesseract::TessBaseAPI tessApi;
tessApi.InitForAnalysePage();

// tessApi.SetImage(...);

tesseract::PageIterator *iter = tessApi.AnalyseLayout();

// Instead of RIL_WORD you can use any other PageSegMode
while (iter->Next(tesseract::RIL_WORD)) {
    int left, top, right, bottom;

    iter->BoundingBox(
            tesseract::RIL_WORD,
            &left, &top, &right, &bottom
    );
}

0 讨论(0)

佛祖请我去吃肉

2020-12-23 11:34
Tesseract can be given a page mode parameter (-psm) which can have the following values:
- 0 = Orientation and script detection (OSD) only.
- 1 = Automatic page segmentation with OSD.
- 2 = Automatic page segmentation, but no OSD, or OCR
- 3 = Fully automatic page segmentation, but no OSD. (Default)
- 4 = Assume a single column of text of variable sizes.
- 5 = Assume a single uniform block of vertically aligned text.
- 6 = Assume a single uniform block of text.
- 7 = Treat the image as a single text line.
- 8 = Treat the image as a single word.
- 9 = Treat the image as a single word in a circle.
- 10 = Treat the image as a single character.
Example:
```
tesseract image.tif image.txt -l eng -psm 0
```
However, I am not sure that it is possible to use the layout analysis in standalone mode.
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-12-23 11:38
There is an option since 3.04:
```
tesseract -c preserve_interword_spaces=1 test.tif test
```
Here is a reference to what looks like the related development thread.
0 讨论(0)
发布评论:

提交评论
- 加载中...