How do I segment a document using Tesseract then output the resulting bounding boxes and labels

后端 未结 6 2036
忘了有多久
忘了有多久 2020-12-07 10:25

I\'m trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this \'out of the

相关标签:
6条回答
  • 2020-12-07 11:00

    With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.

    More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs

    0 讨论(0)
  • 2020-12-07 11:02

    If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:

    from PIL import Image, ImageDraw
    from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM
    
    img = Image.open(filename)
    
    results = []
    with PyTessBaseAPI() as api:
        api.SetImage(img)
        api.SetPageSegMode(PSM.AUTO_ONLY)
        iterator = api.AnalyseLayout()
        for w in iterate_level(iterator, RIL.BLOCK):
            if w is not None:
                results.append((w.BlockType(), w.BlockPolygon()))
    print('Found {} block elements.'.format(len(results)))
    
    draw = ImageDraw.Draw(img)
    for block_type, poly in results:
        # you can define a color per block type (see tesserocr.PT for block types list)
        draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)
    
    0 讨论(0)
  • 2020-12-07 11:07

    The HOCR individual character step is now available in Tesseract since 4.1. Once the installation check, use :

    tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

    0 讨论(0)
  • 2020-12-07 11:15

    You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.

    0 讨论(0)
  • 2020-12-07 11:18

    Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.

    Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.

    Overview

    1. Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
    2. Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
    3. Use the PVT to view the original image with the PAGE xml information overlaid

    Code

    brew install wine  # takes a little while >10m
    brew install gs    # only for generating a tif example. Not required, you can use Preview
    brew install wget  # only for downloading example paper. Not required, you can do so manually!
    cd ~/Downloads
    wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
    # This command can be ommitted and you can do the conversion to tiff with Preview
    gs                          \
      -o paper-%d.tif           \
      -sDEVICE=tiff24nc         \
      -r300x300                 \
       paper.pdf 
    
    cd ~/Downloads
    # ttptool is the location you downloaded the Tesseract to PAGE tool to
    ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
    # sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
    touch "$ttptool/log.txt"
    wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"   \
      -inp-img "$dl/Downloads/paper-3.tif"           \
      -out-xml "$dl/Downloads/paper-3-tool.xml"      \
      -rec-mode layout>>log.txt
    
    # pvtool is the location you downloaded the PAGE Viewer tool to
    pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
    cd "$pvtool"
    dl=~
    java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"
    

    Results

    Document with overlays (rollover to see text and type) Doc with overlays Overlays alone (use GUI buttons to toggle) Overlays alone

    Appendix

    You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!

    # Note that the pvtool does take as input HOCR xml but it ignores the region type
    brew install tesseract --devel  # installs v 3.03 at time of writing
    tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
    mv paper-3.hocr paper-3.xml  # The page viewer will only open XML files
    java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"
    

    At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:

    pctool="/Users/Me/Project/tools/JPageConverter 1.0"
    java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST
    

    Unfortunately, I kept getting null pointers.

    Could not convert to target XML schema format.
    java.lang.NullPointerException
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
    Could not save target PAGE XML file: paper-3-hocrconvert.xml
    java.lang.NullPointerException
        at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
    
    0 讨论(0)
  • 2020-12-07 11:18

    Shortcut

    It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.

    0 讨论(0)
提交回复
热议问题