问题
I would like to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character on that PDF document which I can later use to save as a bitmap. I tried using PdfTextExtractor.GetTextFromPage but that gives the entire text in the PDF as string.
回答1:
The text extraction strategies bundled with iTextSharp (in particular the LocationTextExtractionStrategy
used by default by the PdfTextExtractor.GetTextFromPage
overload without strategy argument) only allows direct access to the collected plain text, not positions.
Chris Haas' MyLocationTextExtractionStrategy
@Chris Haas in his old answer here presents the following extension of the LocationTextExtractionStrategy
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
which makes use of this helper class
//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}
This strategy makes the text chunks and their enclosing rectangles available in the public member List<RectAndText> myPoints
which you can access like this:
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}
//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}
For your task to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character only two details are wrong here:
- the text chunks returned like that may contain multiple characters
- the font information is not provided.
Thus, we have to tweak that a bit:
A new CharLocationTextExtractionStrategy
In addition to the MyLocationTextExtractionStrategy
class the CharLocationTextExtractionStrategy
splits the input by glyph and also provides the font name:
public class CharLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<RectAndTextAndFont> myPoints = new List<RectAndTextAndFont>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
foreach (TextRenderInfo renderInfo in wholeRenderInfo.GetCharacterRenderInfos())
{
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndTextAndFont(rect, renderInfo.GetText(), renderInfo.GetFont().PostscriptFontName));
}
}
}
//Helper class that stores our rectangle, text, and font
public class RectAndTextAndFont
{
public iTextSharp.text.Rectangle Rect;
public String Text;
public String Font;
public RectAndTextAndFont(iTextSharp.text.Rectangle rect, String text, String font)
{
this.Rect = rect;
this.Text = text;
this.Font = font;
}
}
Using this strategy like this
CharLocationTextExtractionStrategy strategy = new CharLocationTextExtractionStrategy();
using (var pdfReader = new PdfReader(testFile))
{
PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
}
foreach (var p in strategy.myPoints)
{
Console.WriteLine(string.Format("<{0}> in {3} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom, p.Font));
}
you get the information by character and including the font.
来源:https://stackoverflow.com/questions/34917572/itextsharp-extract-each-character-and-getrectangle