How to read data from table-structured PDF using itextsharp?

前端 未结 1 1329
逝去的感伤
逝去的感伤 2021-02-10 01:02

I am having a problem with reading some data from pdf file.
My file is structurized and it contains tables and plain text. Standard parser reads data from separate columns a

相关标签:
1条回答
  • 2021-02-10 01:48

    The OP's sample file contains multiple sections like this one:

    And the OP mentioned in a comment:

    another one tool parse my PDF exactly like I want. [...]

    PS: this tool is pdfbox

    Using PDFBox (v1.8.10, the current release version) in this method:

    String extract(PDDocument document) throws IOException
    {
        PDFTextStripper stripper = new PDFTextStripper();
        return stripper.getText(document);
    }
    

    returns for the section shown above

    Driver Book for 8/5/2015
    Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
    CATY
     MEDICAL
    Trip #: 314-A
    Comments: ----LIVERY---
    Destination:Pick-up:
    Call Type: Livery
    <Doctor Office>
    REGO PARK,  (631) 
    000-0000
    (718) 896-5953
    74- AVE 204E  HEIGHTS, NY 
    11372 (718) 639-4154
    11:00:00 PAT, MIKHAIL
    Trip #: 314-B
    Comments:  ----LIVERY---
    Destination:Pick-up:
    Call Type: Livery
    74- AVE 204E  HEIGHTS, NY 
    11372 (718) 639-4154
    <Doctor Office>
    63-6 REGO PARK, NY 
    11374 (631) 000-0000
    11:01:00 PAT, MIKHAIL
    

    This is not really a neat column-wise extraction but certain blocks of information (like address blocks) remain together.

    Getting the same output with iText(Sharp) actually is very easy: One merely has to explicitly use the SimpleTextExtractionStrategy instead of the LocationTextExtractionStrategy which is used by default, i.e. one has to replace this line

    page = PdfTextExtractor.GetTextFromPage(reader, i);
    

    by

    page = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());
    

    With the exception of one space character per dataset (iText(Sharp) extracts Destination: Pick-up: instead of Destination:Pick-up:) the results are identical.


    Concerning your conclusion from PDFBox extracting the text as it does:

    So I think that PDF is really table structured.

    Actually this order of extraction means merely that the operations for drawing the string segments in the PDF page content stream occur in this very order. As the order of those operations is arbitrary according to the PDF specification, any update of the software generating those PDFs may result in files from which the PDFBox PDFTextStripper and the iText SimpleTextExtractionStrategy extract merely an unintelligible soup of characters.


    PS: If one sets the PDFBox PDFTextStripper property SortByPosition to true like this

        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setSortByPosition(true);
        return stripper.getText(document);
    

    then PDFBox extracts the text just like iText(Sharp) with the (default) LocationTextExtractionStrategy does


    The OP indicated interest in a block structure inherent in the content stream. The most obvious structure like that in a generic PDF would be the text objects (in which multiple strings may be drawn).

    In the case at hand the SimpleTextExtractionStrategy is used. It can easily be extended to also include markers corresponding to the start and end of text objects in its output. In Java this can be done by using an anonymous class like this:

    return PdfTextExtractor.getTextFromPage(reader, pageNo, new SimpleTextExtractionStrategy()
    {
        boolean empty = true;
    
        @Override
        public void beginTextBlock()
        {
            if (!empty)
                appendTextChunk("<BLOCK>");
            super.beginTextBlock();
        }
    
        @Override
        public void endTextBlock()
        {
            if (!empty)
                appendTextChunk("</BLOCK>\n");
            super.endTextBlock();
        }
    
        @Override
        public String getResultantText()
        {
            if (empty)
                return super.getResultantText();
            else
                return "<BLOCK>" + super.getResultantText();
        }
    
        @Override
        public void renderText(TextRenderInfo renderInfo)
        {
            empty = false;
            super.renderText(renderInfo);
        }
    });
    

    (TextExtraction.java method extractSimple)

    (This Java code should easily be translatable into C#. The playing around with an empty boolean may look funny; it is necessary, though, because the base class assumes certain additional properties to be set as soon as some chunk has been appended to the extracted content.)

    Using this extended strategy one gets for the section shown above:

    <BLOCK>Driver Book for 8/5/2015
    Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
    CATY</BLOCK>
    <BLOCK>
     MEDICAL</BLOCK>
    <BLOCK>
    Trip #: 314-A</BLOCK>
    <BLOCK>
    Comments: ----LIVERY---</BLOCK>
    <BLOCK>
    Destination: Pick-up:</BLOCK>
    <BLOCK>
    Call Type: Livery
    <Doctor Office>
    REGO PARK,  (631) 
    000-0000
    (718) 896-5953</BLOCK>
    <BLOCK>
    74- AVE 204E  HEIGHTS, NY 
    11372 (718) 639-4154</BLOCK>
    <BLOCK>
    11:00:00</BLOCK>
    <BLOCK> PAT, MIKHAIL</BLOCK>
    <BLOCK>
    Trip #: 314-B</BLOCK>
    <BLOCK>
    Comments:  ----LIVERY---</BLOCK>
    <BLOCK>
    Destination: Pick-up:</BLOCK>
    <BLOCK>
    Call Type: Livery
    74- AVE 204E  HEIGHTS, NY 
    11372 (718) 639-4154</BLOCK>
    <BLOCK>
    <Doctor Office>
    63-6 REGO PARK, NY 
    11374 (631) 000-0000</BLOCK>
    <BLOCK>
    11:01:00</BLOCK>
    <BLOCK> PAT, MIKHAIL</BLOCK>
    

    As this keeps addresses in the same block, this might help during extraction.

    0 讨论(0)
提交回复
热议问题