How to extract relevant information from receipt

前端 未结 2 497
天命终不由人
天命终不由人 2021-02-06 11:23

I am trying to extract information from a range of different receipts using a combination of Opencv, Tesseract and Keras. The end result of the project is that I should be able

相关标签:
2条回答
  • 2021-02-06 11:49

    its a good idea to use image, as you will loose the structure of the document if you just you plain OCR. I think you are on right track. I would segment the bill in to headers, total amount, line items and get an image classifier trained on it. Then you could use it to clean/extract relevant information that you need from the text

    0 讨论(0)
  • 2021-02-06 11:53

    My answer isn't as fancy as what's in fashion right now, but I think it works in your case, specially if this is for a product (not for research & publication purposes).

    I would implement the paper Text/Graphics Separation Revisited. I have already implemented it in both Matlab & C++ and I guarantee from your description it won't take you long. In summary:

    1. Get all connected components with stats. You're specially interested in the bounding box for each character.

    2. The paper obtains thresholds from histograms on the properties of your connected components, which makes it a bit robust. Using these thresholds (that work surprisingly well) on the geometrical properties of your connected components, discard anything that's not a character.

    3. For your characters, get the centroid for all of their bounding boxes and group the close centroids by your own criteria (height, vertical position, euclidean distance, etc.). Use the obtained centroid clusters to create rectangular text regions.

    4. Associate text regions of same height and vertical position.

    5. Run OCR on your text regions and look for keywords like "Cash". I honestly think you can get away with having dictionaries with text files, and from having done computer vision for mobile I know your resources are limited (by privacy too).

    I honestly don't think a neural net will be much better than some kind of keyword matching (e.g. using Levenshtein distance or something similar to add a bit of robustness) because you will need to manually create and label these words anyway to create your training dataset, so... Why not just write them down instead?

    That's basically it. You end up with something very fast (specially if you want to use a phone and you can't send images to a server) and it just works. No machine learning needed, so no dataset needed either.

    But if this is for school... Sorry I was so rude. Please use TensorFlow with 10,000 manually labeled receipt images and natural language processing methods, your professor will be happy.

    0 讨论(0)
提交回复
热议问题