Read specific value based on label name from PDF in C#

前端 未结 3 1369
梦谈多话
梦谈多话 2021-01-19 11:07

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name.

3条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-19 11:30

    It isn't immediately clear from the answer whether pdfText will contain the Invoice number along with the rest of the text, but I'll assume it does. If it doesn't, then you will need OCR, which is a different beast entirely.

    My first instinct would be to build a regex (^\d{6}$) in this case and try to apply it on all text on the page. If there is only one match (the invoice #), then great! Otherwise if it matches more things, you could find all occurences and look for a pattern. For example, if customers had an ID that also matched that regex, you could extract all lines which contain a matching number, and discard all lines that contain some other info (maybe all lines with a customer # would also have a date in a specific format for instance). Basically find all occurences where the regex could match, and try to find rules to exclude all the occurences you don't care about.

提交回复
热议问题