iTextSharp extract each character and getRectangle

a 夏天 提交于 2019-12-25 03:38:27

问题


I would like to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character on that PDF document which I can later use to save as a bitmap. I tried using PdfTextExtractor.GetTextFromPage but that gives the entire text in the PDF as string.


回答1:


The text extraction strategies bundled with iTextSharp (in particular the LocationTextExtractionStrategy used by default by the PdfTextExtractor.GetTextFromPage overload without strategy argument) only allows direct access to the collected plain text, not positions.

Chris Haas' MyLocationTextExtractionStrategy

@Chris Haas in his old answer here presents the following extension of the LocationTextExtractionStrategy

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

which makes use of this helper class

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

This strategy makes the text chunks and their enclosing rectangles available in the public member List<RectAndText> myPoints which you can access like this:

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

For your task to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character only two details are wrong here:

  • the text chunks returned like that may contain multiple characters
  • the font information is not provided.

Thus, we have to tweak that a bit:

A new CharLocationTextExtractionStrategy

In addition to the MyLocationTextExtractionStrategy class the CharLocationTextExtractionStrategy splits the input by glyph and also provides the font name:

public class CharLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<RectAndTextAndFont> myPoints = new List<RectAndTextAndFont>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo wholeRenderInfo)
    {
        base.RenderText(wholeRenderInfo);

        foreach (TextRenderInfo renderInfo in wholeRenderInfo.GetCharacterRenderInfos())
        {
            //Get the bounding box for the chunk of text
            var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
            var topRight = renderInfo.GetAscentLine().GetEndPoint();

            //Create a rectangle from it
            var rect = new iTextSharp.text.Rectangle(
                                                    bottomLeft[Vector.I1],
                                                    bottomLeft[Vector.I2],
                                                    topRight[Vector.I1],
                                                    topRight[Vector.I2]
                                                    );

            //Add this to our main collection
            this.myPoints.Add(new RectAndTextAndFont(rect, renderInfo.GetText(), renderInfo.GetFont().PostscriptFontName));
        }
    }
}

//Helper class that stores our rectangle, text, and font
public class RectAndTextAndFont
{
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public String Font;
    public RectAndTextAndFont(iTextSharp.text.Rectangle rect, String text, String font)
    {
        this.Rect = rect;
        this.Text = text;
        this.Font = font;
    }
}

Using this strategy like this

CharLocationTextExtractionStrategy strategy = new CharLocationTextExtractionStrategy();

using (var pdfReader = new PdfReader(testFile))
{
    PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
}

foreach (var p in strategy.myPoints)
{
    Console.WriteLine(string.Format("<{0}> in {3} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom, p.Font));
}

you get the information by character and including the font.



来源:https://stackoverflow.com/questions/34917572/itextsharp-extract-each-character-and-getrectangle

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!