I have an asp.net Core 2.0 C#
application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name.
It isn't immediately clear from the answer whether pdfText
will contain the Invoice number along with the rest of the text, but I'll assume it does. If it doesn't, then you will need OCR, which is a different beast entirely.
My first instinct would be to build a regex (^\d{6}$
) in this case and try to apply it on all text on the page. If there is only one match (the invoice #), then great! Otherwise if it matches more things, you could find all occurences and look for a pattern. For example, if customers had an ID that also matched that regex, you could extract all lines which contain a matching number, and discard all lines that contain some other info (maybe all lines with a customer # would also have a date in a specific format for instance). Basically find all occurences where the regex could match, and try to find rules to exclude all the occurences you don't care about.
Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.
One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.
I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.
The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.
So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.
The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.
Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.
So you need to play with it a bit to fit it 100% to your pdf file.
So here is my test files Excel and Pdf zipped down. Download it for your test.
Here is the code:
public class InvoiceTextExtraction
{
private List<string> _contentList;
public void GetValueFromPdf()
{
_contentList = new List<string>();
CreatePdfContent(@"C:\temp\Invoice1.pdf");
var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
int.TryParse(_contentList[index], out var value);
Console.WriteLine(value);
}
public void CreatePdfContent(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
}
Input extracted from pdf file. The code scan return following elements:
INVOICE
0005
PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019
and here is the result
5
The code is inspired from this link.