iTextSharp extract each character and getRectangle

问题

I would like to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character on that PDF document which I can later use to save as a bitmap. I tried using PdfTextExtractor.GetTextFromPage but that gives the entire text in the PDF as string.

回答1:

The text extraction strategies bundled with iTextSharp (in particular the LocationTextExtractionStrategy used by default by the PdfTextExtractor.GetTextFromPage overload without strategy argument) only allows direct access to the collected plain text, not positions.

Chris Haas' `MyLocationTextExtractionStrategy`

@Chris Haas in his old answer here presents the following extension of the LocationTextExtractionStrategy

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

which makes use of this helper class

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

This strategy makes the text chunks and their enclosing rectangles available in the public member List<RectAndText> myPoints which you can access like this:

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

For your task to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character only two details are wrong here:

the text chunks returned like that may contain multiple characters
the font information is not provided.

Thus, we have to tweak that a bit:

A new `CharLocationTextExtractionStrategy`

In addition to the MyLocationTextExtractionStrategy class the CharLocationTextExtractionStrategy splits the input by glyph and also provides the font name:

public class CharLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<RectAndTextAndFont> myPoints = new List<RectAndTextAndFont>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo wholeRenderInfo)
    {
        base.RenderText(wholeRenderInfo);

        foreach (TextRenderInfo renderInfo in wholeRenderInfo.GetCharacterRenderInfos())
        {
            //Get the bounding box for the chunk of text
            var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
            var topRight = renderInfo.GetAscentLine().GetEndPoint();

            //Create a rectangle from it
            var rect = new iTextSharp.text.Rectangle(
                                                    bottomLeft[Vector.I1],
                                                    bottomLeft[Vector.I2],
                                                    topRight[Vector.I1],
                                                    topRight[Vector.I2]
                                                    );

            //Add this to our main collection
            this.myPoints.Add(new RectAndTextAndFont(rect, renderInfo.GetText(), renderInfo.GetFont().PostscriptFontName));
        }
    }
}

//Helper class that stores our rectangle, text, and font
public class RectAndTextAndFont
{
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public String Font;
    public RectAndTextAndFont(iTextSharp.text.Rectangle rect, String text, String font)
    {
        this.Rect = rect;
        this.Text = text;
        this.Font = font;
    }
}

Using this strategy like this

CharLocationTextExtractionStrategy strategy = new CharLocationTextExtractionStrategy();

using (var pdfReader = new PdfReader(testFile))
{
    PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
}

foreach (var p in strategy.myPoints)
{
    Console.WriteLine(string.Format("<{0}> in {3} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom, p.Font));
}

you get the information by character and including the font.

来源：https://stackoverflow.com/questions/34917572/itextsharp-extract-each-character-and-getrectangle

标签

itextsharp

pdf-extraction

iTextSharp extract each character and getRectangle

问题

回答1:

Chris Haas' MyLocationTextExtractionStrategy

A new CharLocationTextExtractionStrategy

Chris Haas' `MyLocationTextExtractionStrategy`

A new `CharLocationTextExtractionStrategy`