how to get the particular paragraph in pdf file using iTextSharp in C#?

前端 未结 1 449
孤独总比滥情好
孤独总比滥情好 2020-12-29 17:03

I am using iTextSharp in my C# winform application.I want to get particular paragraph in PDF file. Is this possible in iTextSharp?

相关标签:
1条回答
  • 2020-12-29 17:36

    Yes and no.

    First the no. The PDF format doesn't have a concept of text structures such as paragraphs, sentences or even words, it just has runs of text. The fact that two runs of text are near to each other so that we think of them as structured is a human thing. When you see something that looks like a three line paragraph in a PDF, in reality the program that generated the PDF actually did the job of chopping up the text into three unrelated text lines and then drew each line at specific x,y coordinates. And even worse, depending on what the designer wants, each line of text could be composed of smaller runs that could be words or even just characters. So it might be draw "the cat in the hat" at 10,10 or it might be draw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10 and so on. This is actually pretty common with PDFs from heavily designed programs like Adobe InDesign.

    Now the yes. Actually its a maybe. If you are willing to put in a little work you might be able to get iTextSharp to do what you are looking for. There is a class called PdfTextExtractor that has a method called GetTextFromPage that will get all of the raw text from a page. The last parameter to this method is an object that implements the ITextExtractionStrategy interface. If you create your own class that implements this interface you can process each run of text and perform your own logic.

    In this interface there's a method called RenderText which gets called for every run of text. You'll be given a iTextSharp.text.pdf.parser.TextRenderInfo object from which you can get the raw text from the run as well as other things like current coordinates that it is starting at, current font, etc. Since a visual line of text can be composed of multiple runs, you can use this method to compare the run's baseline (the starting x coordinate) to the previous run to determine if it is part of the same visual line.

    Below is an example of an implementation of that interface:

        public class TextAsParagraphsExtractionStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy {
            //Text buffer
            private StringBuilder result = new StringBuilder();
    
            //Store last used properties
            private Vector lastBaseLine;
    
            //Buffer of lines of text and their Y coordinates. NOTE, these should be exposed as properties instead of fields but are left as is for simplicity's sake
            public List<string> strings = new List<String>();
            public List<float> baselines = new List<float>();
    
            //This is called whenever a run of text is encountered
            public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
                //This code assumes that if the baseline changes then we're on a newline
                Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
    
                //See if the baseline has changed
                if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2])) {
                    //See if we have text and not just whitespace
                    if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                        //Mark the previous line as done by adding it to our buffers
                        this.baselines.Add(this.lastBaseLine[Vector.I2]);
                        this.strings.Add(this.result.ToString());
                    }
                    //Reset our "line" buffer
                    this.result.Clear();
                }
    
                //Append the current text to our line buffer
                this.result.Append(renderInfo.GetText());
    
                //Reset the last used line
                this.lastBaseLine = curBaseline;
            }
    
            public string GetResultantText() {
                //One last time, see if there's anything left in the buffer
                if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                    this.baselines.Add(this.lastBaseLine[Vector.I2]);
                    this.strings.Add(this.result.ToString());
                }
                //We're not going to use this method to return a string, instead after callers should inspect this class's strings and baselines fields.
                return null;
            }
    
            //Not needed, part of interface contract
            public void BeginTextBlock() { }
            public void EndTextBlock() { }
            public void RenderImage(ImageRenderInfo renderInfo) { }
        }
    

    To call it we'd do:

            PdfReader reader = new PdfReader(workingFile);
            TextAsParagraphsExtractionStrategy S = new TextAsParagraphsExtractionStrategy();
            iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
            for (int i = 0; i < S.strings.Count; i++) {
                Console.WriteLine("Line {0,-5}: {1}", S.baselines[i], S.strings[i]);
            }
    

    We're actually throwing away the value from GetTextFromPage and instead inspecting the worker's baselines and strings array fields. The next step for this would be to compare the baselines and try to determine how to group lines together to become paragraphs.

    I should note, not all paragraphs have spacing that's different from individual lines of text. For instance, if you run the PDF created below through the code above you'll see that every line of text is 18 points away from each other, regardless of if the line forms a new paragraph or not. If you open the PDF it creates in Acrobat and cover everything but the first letter of each line you'll see that your eye can't even tell the difference between a line break and a paragraph break.

            using (FileStream fs = new FileStream(workingFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                using (Document doc = new Document(PageSize.LETTER)) {
                    using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                        doc.Open();
                        doc.Add(new Paragraph("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna."));
                        doc.Add(new Paragraph("This"));
                        doc.Add(new Paragraph("Is"));
                        doc.Add(new Paragraph("A"));
                        doc.Add(new Paragraph("Test"));
                        doc.Close();
                    }
                }
            }
    
    0 讨论(0)
提交回复
热议问题