Page layout analysis using Tesseract?

后端 未结 4 1832
一个人的身影
一个人的身影 2020-12-23 10:40

Tesseract 3 is able to perform page layout analysis. However, I couldn\'t find any sample code or documentation on how to use the library for such purposes. I hope someone h

相关标签:
4条回答
  • 2020-12-23 11:16

    Not sure if this exactly answers your question, but I landed here looking for ways to get the bbox-coordinates info (and text recognised inside the bbox optionally) given an input image. The solution to which is now possible using tesseract.

    $> tesseract test.tiff test.txt -l eng -psm 1 tsv
    

    The params to notice in above code-snippet are 'psm' and 'tsv'. 'psm' selects the page segmentation mode and 'tsv' generates a nice tabular output file with all the information (page-block-line number, bbox coods, confidence, predicted text) you'd need on your text-image (shown below)

    level   page_num    block_num   par_num line_num    word_num    left    top width   height  conf    text
    1   1   0   0   0   0   0   0   5500    4250    -1
    2   1   1   0   0   0   327 285 2218    53  -1
    3   1   1   1   0   0   327 285 2218    53  -1
    4   1   1   1   1   0   327 285 2218    53  -1
    5   1   1   1   1   1   327 285 246 38  87  INFOPAC
    5   1   1   1   1   2   620 287 165 38  87  PAGE
    5   1   1   1   1   3   952 290 100 37  95  NAME
    5   1   1   1   1   4   1173    292 1082    45  39  ENTRYDATE
    5   1   1   1   1   5   2333    302 212 36  48  EMAIL
    
    0 讨论(0)
  • 2020-12-23 11:23

    First, initialize TessBaseAPI instance. You can either use Init() (if you want to perform further text recognition) or InitForAnalysePage() (if you're interested just in text boxes).

    Second, set the image using SetImage().

    And finally, call AnalyseLayout() to get PageIterator which provides you with text boxes.

    tesseract::TessBaseAPI tessApi;
    tessApi.InitForAnalysePage();
    
    // tessApi.SetImage(...);
    
    tesseract::PageIterator *iter = tessApi.AnalyseLayout();
    
    // Instead of RIL_WORD you can use any other PageSegMode
    while (iter->Next(tesseract::RIL_WORD)) {
        int left, top, right, bottom;
    
        iter->BoundingBox(
                tesseract::RIL_WORD,
                &left, &top, &right, &bottom
        );
    }
    
    0 讨论(0)
  • 2020-12-23 11:34

    Tesseract can be given a page mode parameter (-psm) which can have the following values:

    • 0 = Orientation and script detection (OSD) only.
    • 1 = Automatic page segmentation with OSD.
    • 2 = Automatic page segmentation, but no OSD, or OCR
    • 3 = Fully automatic page segmentation, but no OSD. (Default)
    • 4 = Assume a single column of text of variable sizes.
    • 5 = Assume a single uniform block of vertically aligned text.
    • 6 = Assume a single uniform block of text.
    • 7 = Treat the image as a single text line.
    • 8 = Treat the image as a single word.
    • 9 = Treat the image as a single word in a circle.
    • 10 = Treat the image as a single character.

    Example:

    tesseract image.tif image.txt -l eng -psm 0
    

    However, I am not sure that it is possible to use the layout analysis in standalone mode.

    0 讨论(0)
  • 2020-12-23 11:38

    There is an option since 3.04:

    tesseract -c preserve_interword_spaces=1 test.tif test
    

    Here is a reference to what looks like the related development thread.

    0 讨论(0)
提交回复
热议问题