I am having a problem with reading some data from pdf file.
My file is structurized and it contains tables and plain text. Standard parser reads data from separate columns a
The OP's sample file contains multiple sections like this one:
And the OP mentioned in a comment:
another one tool parse my PDF exactly like I want. [...]
PS: this tool is pdfbox
Using PDFBox (v1.8.10, the current release version) in this method:
String extract(PDDocument document) throws IOException
{
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(document);
}
returns for the section shown above
Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS MEDICAL:
CATY
MEDICAL
Trip #: 314-A
Comments: ----LIVERY---
Destination:Pick-up:
Call Type: Livery
REGO PARK, (631)
000-0000
(718) 896-5953
74- AVE 204E HEIGHTS, NY
11372 (718) 639-4154
11:00:00 PAT, MIKHAIL
Trip #: 314-B
Comments: ----LIVERY---
Destination:Pick-up:
Call Type: Livery
74- AVE 204E HEIGHTS, NY
11372 (718) 639-4154
63-6 REGO PARK, NY
11374 (631) 000-0000
11:01:00 PAT, MIKHAIL
This is not really a neat column-wise extraction but certain blocks of information (like address blocks) remain together.
Getting the same output with iText(Sharp) actually is very easy: One merely has to explicitly use the SimpleTextExtractionStrategy
instead of the LocationTextExtractionStrategy
which is used by default, i.e. one has to replace this line
page = PdfTextExtractor.GetTextFromPage(reader, i);
by
page = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());
With the exception of one space character per dataset (iText(Sharp) extracts Destination: Pick-up:
instead of Destination:Pick-up:
) the results are identical.
Concerning your conclusion from PDFBox extracting the text as it does:
So I think that PDF is really table structured.
Actually this order of extraction means merely that the operations for drawing the string segments in the PDF page content stream occur in this very order. As the order of those operations is arbitrary according to the PDF specification, any update of the software generating those PDFs may result in files from which the PDFBox PDFTextStripper
and the iText SimpleTextExtractionStrategy
extract merely an unintelligible soup of characters.
PS: If one sets the PDFBox PDFTextStripper
property SortByPosition
to true
like this
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
return stripper.getText(document);
then PDFBox extracts the text just like iText(Sharp) with the (default) LocationTextExtractionStrategy
does
The OP indicated interest in a block structure inherent in the content stream. The most obvious structure like that in a generic PDF would be the text objects (in which multiple strings may be drawn).
In the case at hand the SimpleTextExtractionStrategy
is used. It can easily be extended to also include markers corresponding to the start and end of text objects in its output. In Java this can be done by using an anonymous class like this:
return PdfTextExtractor.getTextFromPage(reader, pageNo, new SimpleTextExtractionStrategy()
{
boolean empty = true;
@Override
public void beginTextBlock()
{
if (!empty)
appendTextChunk("");
super.beginTextBlock();
}
@Override
public void endTextBlock()
{
if (!empty)
appendTextChunk(" \n");
super.endTextBlock();
}
@Override
public String getResultantText()
{
if (empty)
return super.getResultantText();
else
return "" + super.getResultantText();
}
@Override
public void renderText(TextRenderInfo renderInfo)
{
empty = false;
super.renderText(renderInfo);
}
});
(TextExtraction.java method extractSimple
)
(This Java code should easily be translatable into C#. The playing around with an empty
boolean may look funny; it is necessary, though, because the base class assumes certain additional properties to be set as soon as some chunk has been appended to the extracted content.)
Using this extended strategy one gets for the section shown above:
Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS MEDICAL:
CATY
MEDICAL
Trip #: 314-A
Comments: ----LIVERY---
Destination: Pick-up:
Call Type: Livery
REGO PARK, (631)
000-0000
(718) 896-5953
74- AVE 204E HEIGHTS, NY
11372 (718) 639-4154
11:00:00
PAT, MIKHAIL
Trip #: 314-B
Comments: ----LIVERY---
Destination: Pick-up:
Call Type: Livery
74- AVE 204E HEIGHTS, NY
11372 (718) 639-4154
63-6 REGO PARK, NY
11374 (631) 000-0000
11:01:00
PAT, MIKHAIL
As this keeps addresses in the same block, this might help during extraction.