I am using iTextSharp in my C# winform application.I want to get particular paragraph in PDF file. Is this possible in iTextSharp?
Yes and no.
First the no. The PDF format doesn't have a concept of text structures such as paragraphs, sentences or even words, it just has runs of text. The fact that two runs of text are near to each other so that we think of them as structured is a human thing. When you see something that looks like a three line paragraph in a PDF, in reality the program that generated the PDF actually did the job of chopping up the text into three unrelated text lines and then drew each line at specific x,y coordinates. And even worse, depending on what the designer wants, each line of text could be composed of smaller runs that could be words or even just characters. So it might be draw "the cat in the hat" at 10,10
or it might be draw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10
and so on. This is actually pretty common with PDFs from heavily designed programs like Adobe InDesign.
Now the yes. Actually its a maybe. If you are willing to put in a little work you might be able to get iTextSharp to do what you are looking for. There is a class called PdfTextExtractor
that has a method called GetTextFromPage
that will get all of the raw text from a page. The last parameter to this method is an object that implements the ITextExtractionStrategy
interface. If you create your own class that implements this interface you can process each run of text and perform your own logic.
In this interface there's a method called RenderText
which gets called for every run of text. You'll be given a iTextSharp.text.pdf.parser.TextRenderInfo
object from which you can get the raw text from the run as well as other things like current coordinates that it is starting at, current font, etc. Since a visual line of text can be composed of multiple runs, you can use this method to compare the run's baseline (the starting x coordinate) to the previous run to determine if it is part of the same visual line.
Below is an example of an implementation of that interface:
public class TextAsParagraphsExtractionStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy {
//Text buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
//Buffer of lines of text and their Y coordinates. NOTE, these should be exposed as properties instead of fields but are left as is for simplicity's sake
public List<string> strings = new List<String>();
public List<float> baselines = new List<float>();
//This is called whenever a run of text is encountered
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
//See if the baseline has changed
if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2])) {
//See if we have text and not just whitespace
if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
//Mark the previous line as done by adding it to our buffers
this.baselines.Add(this.lastBaseLine[Vector.I2]);
this.strings.Add(this.result.ToString());
}
//Reset our "line" buffer
this.result.Clear();
}
//Append the current text to our line buffer
this.result.Append(renderInfo.GetText());
//Reset the last used line
this.lastBaseLine = curBaseline;
}
public string GetResultantText() {
//One last time, see if there's anything left in the buffer
if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
this.baselines.Add(this.lastBaseLine[Vector.I2]);
this.strings.Add(this.result.ToString());
}
//We're not going to use this method to return a string, instead after callers should inspect this class's strings and baselines fields.
return null;
}
//Not needed, part of interface contract
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
To call it we'd do:
PdfReader reader = new PdfReader(workingFile);
TextAsParagraphsExtractionStrategy S = new TextAsParagraphsExtractionStrategy();
iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
for (int i = 0; i < S.strings.Count; i++) {
Console.WriteLine("Line {0,-5}: {1}", S.baselines[i], S.strings[i]);
}
We're actually throwing away the value from GetTextFromPage
and instead inspecting the worker's baselines
and strings
array fields. The next step for this would be to compare the baselines and try to determine how to group lines together to become paragraphs.
I should note, not all paragraphs have spacing that's different from individual lines of text. For instance, if you run the PDF created below through the code above you'll see that every line of text is 18 points away from each other, regardless of if the line forms a new paragraph or not. If you open the PDF it creates in Acrobat and cover everything but the first letter of each line you'll see that your eye can't even tell the difference between a line break and a paragraph break.
using (FileStream fs = new FileStream(workingFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (Document doc = new Document(PageSize.LETTER)) {
using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
doc.Add(new Paragraph("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna."));
doc.Add(new Paragraph("This"));
doc.Add(new Paragraph("Is"));
doc.Add(new Paragraph("A"));
doc.Add(new Paragraph("Test"));
doc.Close();
}
}
}