PdfReaderContentParser.ProcessContent returns whitespace for clear text

旧城冷巷雨未停 提交于 2019-12-12 04:09:47

问题


I'd like to parse a pdf for texts containing both, binary and clear text data. When I try to do it with PdfReaderContentParser the GetResultantText method returns the right texts for the binary content but whitespaces for the clear text content. Here is the code I use:

        byte[] binaryPdf = File.ReadAllBytes(this.fileName);
        reader = new PdfReader(binaryPdf);

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            SimpleTextExtractionStrategy simpleStragety = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
            string contentText = simpleStragety.GetResultantText();

            // Do something with the contentText
            // ...
        }

Any idea how to get all content?


回答1:


Overview

In a comment the OP clarified which texts he was missing in his extracted text:

Basically for all descriptions on the left-hand side (e.g. Lifting moment) I get whitespaces instead of the actual text.

The reason for this is fairly simple: In the page content there are only spaces (if anything at all) on most of the left side. The labels you see actually are read-only form fields.

For example the "Lifting moment" is the value of the form field 13B141032.

If you want text extraction to include these fields, too, you should consider flattening the document in a first step (moving the field appearances into the regular page content stream) and extracting text from this flattened document.

Document analysis

It looks like the major part of the internationalization of the specification labels has been done using form fields.

For an overview I separated the original document

into its regular page content

and the form fields

There indeed are several strings of spaces in the page content under the form fields.

I would assume that there once was an earlier version of that document (or a template for it) which contained those labels (maybe in only one language or probably two) as page content.

Then there was a task of more dynamic internationalization, so someone replaced the existing labels in the page content by spaces and added new internationalized labels as read-only form-fields, probably because form fields are easier to manipulate.

Considering that the original labels seem to have been replaced by an equal number of spaces, though, one might speculate that there even is another program manipulating the page stream of this and similar documents at hard coded offsets, and to not break this program in the course of internationalization the actual labels had to be created outside the page content. Stranger things have happened...

Flatten and extract

As mentioned above, if you want text extraction to include these fields, too, you should consider flattening the document in a first step (moving the field appearances into the regular page content stream) and extracting text from this flattened document. This can be done like this:

[Test]
public void ExtractFlattenedTextTestSeeb()
{
    FileInfo file = new FileInfo(@"PATH_TO_FILE\41851208.pdf");
    Console.Out.Write("41851208.pdf, flattened before extraction\n\n");

    using (MemoryStream memStream = new MemoryStream())
    {
        using (PdfReader readerOrig = new PdfReader(file.FullName))
        using (PdfStamper stamper = new PdfStamper(readerOrig, memStream))
        {
            stamper.Writer.CloseStream = false;
            stamper.FormFlattening = true;
        }
        memStream.Position = 0;
        using (PdfReader readerFlat = new PdfReader(memStream))
        {
            PdfReaderContentParser parser = new PdfReaderContentParser(readerFlat);

            for (int i = 1; i <= readerFlat.NumberOfPages; i++)
            {
                SimpleTextExtractionStrategy simpleStragety = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
                string contentText = simpleStragety.GetResultantText();

                Console.Write("Page {0}:\n\n{1}\n\n", i, contentText);
            }
        }
    }
}

The result StandardOutput:

41851208.pdf, flattened before extraction

Page 1:

90–120 l/min 
(23.8–31.7 US gal./min) 
60 kg 
(132 lbs) 
115 kg 
(254 lbs) 
350 l 
(92.5 US gal.) 
100 kg 105 kg 
(220 lbs) (231 kg) 
100 kg 
(220 lbs) 
250 l 300 l 
(66.0 US gal.) (79.3 US gal.) 
90 kg 
(198 lbs) 
180 l 
(47.6 US gal.) 
5305kg 
(11695 lbs) 
5265kg 
(11607 lbs) 
5395kg 
(11894 lbs) 
5205kg 
(11475 lbs) 
5010kg 
(11045 lbs) 
4780kg 
(10538 lbs) 
4470kg 
(9854 lbs) 
4190kg 
(9237 lbs) 
3930kg 
(8664 lbs) 
5215kg 
(11497 lbs) 
5045kg 
(11122 lbs) 
4860kg 
(10714 lbs) 
4650kg 
(10251 lbs) 
4350kg 
(9590 lbs) 
4100kg 
(9039 lbs) 
3850kg 
(8488 lbs) 
25.2 m 
(82’ 8") 
23.2 m 
(76’ 1") 
21.0 m 
(68’ 11") 
18.7 m 
(61’ 4") 
16.4 m 
(53’ 10") 
14.1 m 
(46’ 3") 
11.8 m 
(38’ 9") 
9.7 m 
(31’ 10") 
7.7 m 
(25’ 3") 
36.5 MPa (365 bar) 
(5293 psi) 
endlos 
endless 
sans finite 
25.2 m 
31.2 m 
(82’ 8") 
(102’ 4") 
21.0 m 
(68’ 11") 
14900kg 
(32848 lbs) 
403.2 kNm (41.1 mt) 
(297270 ft.lbs) 
49.1 kNm (5.0 mt) 
PK 42002–SH A–G 
(36210 ft.lbs) 
37.3 kNm (3.8 mt) 
PK 42002–SH A–C 
(27510 ft.lbs) 

1GETR 2GETR
PK 42002–SH A – C
KT250 KT300 KT350 KT180



2GETR STZY



+V1
+V2
+2/4
7(F) 8(G) 6(E) 5(D) 4(C) 3(B) 2(A)



+V1
+V2







































(S410–SK–D)
DTS410SHC/03
0100
11/2010



PK 42002–SH
Type Model Modell
Page Page Seite
Chapitre Chapter Kapitel
Edition Edition Ausgabe



Öltank
Mehrgewicht: 
Alle Gewichtsangaben ohne Aufbauzubehör,Zusatzgeräte und Öl. 
Hydr. Ausschübe:
Max. Reichweite + Fly-Jib:
Max. Reichweite: 
Fördermenge der Pumpe: 
Betriebsdruck: 
Schwenkmoment: 
Schwenkbereich: 
Max. Reichweite: 
Max. hydraulische Reichweite: 
Max. Hubkraft: 
Max. Hubmoment:
Gewicht +V ohne 2/4
Krangewicht (R3X,STZS): 
Technische Daten 
Konstruktionsänderungen vorbehalten, fertigungstechn. Toleranzen müssen berücksichtigt werden. 
Oil tank
Excess weight: 
All weights given without assembly accessory,additional devices and oil. 
Hydr. boom extensions:
Max. outreach + Fly-Jib: 
Max. outreach: 
Pump capacity: 
Operating pressure:
Slewing torque: 
Slewing angle: 
Max. outreach: 
Max. hydraulic outreach: 
Max. lifting capacity: 
Lifting moment:
Weight +V without 2/4
Crane weight (R3X,STZS): 
Specifications 
Subject to change, production tolerances have to be taken into account. 
Réservoir
Excessif poids: 
Tous les poids sans huile ni accessoire de montage ni appareils accessoires 
Extensions hydrauliques:
Portee maximale + Fly-Jib: 
Max. portee: 
Debit de pompe: 
Pression d' utilisation:
Couple de rotation: 
Angle de rotation: 
Max. portee: 
Portee hydraulique maximale: 
Capacite maxi de levage:
Couple de levage:
Poids +V sans 2/4
Poids grue (R3X,STZS): 
Données Techniques 
Sous reserve de modifications de conception. Les tolerances relatives a la technique de production doivent etre prises en consideration.

As you see, "Lifting moment" and all the other missing labels are there now.



来源:https://stackoverflow.com/questions/33994717/pdfreadercontentparser-processcontent-returns-whitespace-for-clear-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!