How to read data from table-structured PDF using itextsharp?

前端未结

关注

 1  1334

逝去的感伤 2021-02-10 01:02

I am having a problem with reading some data from pdf file.
My file is structurized and it contains tables and plain text. Standard parser reads data from separate columns a

1条回答

感情败类 (楼主)

2021-02-10 01:48
The OP's sample file contains multiple sections like this one:

And the OP mentioned in a comment:

another one tool parse my PDF exactly like I want. [...]

PS: this tool is pdfbox

Using PDFBox (v1.8.10, the current release version) in this method:
```
String extract(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(document);
}
```
returns for the section shown above
```
Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY
 MEDICAL
Trip #: 314-A
Comments: ----LIVERY---
Destination:Pick-up:
Call Type: Livery

REGO PARK,  (631) 
000-0000
(718) 896-5953
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
11:00:00 PAT, MIKHAIL
Trip #: 314-B
Comments:  ----LIVERY---
Destination:Pick-up:
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154

63-6 REGO PARK, NY 
11374 (631) 000-0000
11:01:00 PAT, MIKHAIL
```
This is not really a neat column-wise extraction but certain blocks of information (like address blocks) remain together.

Getting the same output with iText(Sharp) actually is very easy: One merely has to explicitly use the SimpleTextExtractionStrategy instead of the LocationTextExtractionStrategy which is used by default, i.e. one has to replace this line
```
page = PdfTextExtractor.GetTextFromPage(reader, i);
```
by
```
page = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());
```
With the exception of one space character per dataset (iText(Sharp) extracts Destination: Pick-up: instead of Destination:Pick-up:) the results are identical.

Concerning your conclusion from PDFBox extracting the text as it does:

So I think that PDF is really table structured.

Actually this order of extraction means merely that the operations for drawing the string segments in the PDF page content stream occur in this very order. As the order of those operations is arbitrary according to the PDF specification, any update of the software generating those PDFs may result in files from which the PDFBox PDFTextStripper and the iText SimpleTextExtractionStrategy extract merely an unintelligible soup of characters.

PS: If one sets the PDFBox PDFTextStripper property SortByPosition to true like this
```
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);
```
then PDFBox extracts the text just like iText(Sharp) with the (default) LocationTextExtractionStrategy does

The OP indicated interest in a block structure inherent in the content stream. The most obvious structure like that in a generic PDF would be the text objects (in which multiple strings may be drawn).

In the case at hand the SimpleTextExtractionStrategy is used. It can easily be extended to also include markers corresponding to the start and end of text objects in its output. In Java this can be done by using an anonymous class like this:
```
return PdfTextExtractor.getTextFromPage(reader, pageNo, new SimpleTextExtractionStrategy()
{
    boolean empty = true;

    @Override
    public void beginTextBlock()
    {
        if (!empty)
            appendTextChunk("");
        super.beginTextBlock();
    }

    @Override
    public void endTextBlock()
    {
        if (!empty)
            appendTextChunk("\n");
        super.endTextBlock();
    }

    @Override
    public String getResultantText()
    {
        if (empty)
            return super.getResultantText();
        else
            return "" + super.getResultantText();
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        empty = false;
        super.renderText(renderInfo);
    }
});
```
(TextExtraction.java method extractSimple)

(This Java code should easily be translatable into C#. The playing around with an empty boolean may look funny; it is necessary, though, because the base class assumes certain additional properties to be set as soon as some chunk has been appended to the extracted content.)

Using this extended strategy one gets for the section shown above:
```
Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY

 MEDICAL

Trip #: 314-A

Comments: ----LIVERY---

Destination: Pick-up:

Call Type: Livery

REGO PARK,  (631) 
000-0000
(718) 896-5953

74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154

11:00:00
 PAT, MIKHAIL

Trip #: 314-B

Comments:  ----LIVERY---

Destination: Pick-up:

Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154


63-6 REGO PARK, NY 
11374 (631) 000-0000

11:01:00
 PAT, MIKHAIL
```
As this keeps addresses in the same block, this might help during extraction.
0 讨论(0)
发布评论:

提交评论
- 加载中...