Tesseract receipt scanning advice needed

后端 未结 2 1443
暗喜
暗喜 2020-12-23 10:46

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am

相关标签:
2条回答
  • 2020-12-23 11:07

    Text recognition on receipts is one of the hardest problems for OCR to handle.

    The reasons are numerous:

    • receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
    • they have very large amount of dense text (especially Wall-Mart receipts)
    • existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
    • receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.

    Your best bet is to perform the following:

    • Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
    • Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
    • Tesseract training. It's not that hard to do :)
    • OCR result postprocessing to ensure layouting.

    Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.

    Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.

    Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods

    pod try PPBlinkOCR
    
    0 讨论(0)
  • 2020-12-23 11:24

    I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.

    I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.

    I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:

    # -----------------------------------------------------------------
    # Script: ParseReceipt.ps1
    # Author: Jim Sanders
    # Date: 7/27/2015
    # Keywords: tesseract OCR ImageMagick CSV
    # Comments:
    #   Used to convert a Wal-mart receipt image to a CSV file
    # -----------------------------------------------------------------
    param(
        [Parameter(Mandatory=$true)] [string]$image
    ) # end param
    
    # create output and temporary files based on input name
    $base = (Get-ChildItem -Filter $image -File).BaseName
    $csvOutfile = $base + ".txt"
    $upscaleImage = $base + "_150.png"
    $ocrFile = $base + "_ocr"
    
    # upscale by 150% to ensure OCR works consistently
    convert $image -resize 150% $upscaleImage
    
    # perform the OCR to a temporary file
    tesseract $upscaleImage -psm 6 $ocrFile
    
    # column headers for the CSV
    $newline = "Description,UPC,Type,Cost,TaxType`n"
    $newline | Out-File $csvOutfile
    
    # read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
    $lines = Get-Content "$ocrFile.txt"
    
    Foreach ($line in $lines) {
        # This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
        $newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
        $newline | Out-File -Append $csvOutfile
    }
    
    # clean up temporary files
    del $upscaleImage
    del "$ocrFile.txt"
    

    The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.

    I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:

     "=""12345"""
    

    That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.

    Anyway, here is how they look after importing and some quick formating:

    https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png

    I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.

    Thanks for the nudge in the right direction @RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!

    0 讨论(0)
提交回复
热议问题