Get text occurrences contained in a specified area with iTextSharp

前端 未结 2 1486
小鲜肉
小鲜肉 2021-02-10 00:12

Is it possible, using iTextSharp, get all text occurrences contained in a specified area of ​​a pdf document?

2条回答
  •  离开以前
    2021-02-10 00:43

    @BrunoLowagie gives an excellent answer but something I really struggled with was getting the actual coordinates to use. I started out with using Cursor Coordinates from Adobe Acrobat Pro.

    From here I could get the coordinate in inches and calculate the DTP point (PostScript points) by multiplying the value with 72.

    However something was still not right. Looking at the Y value this seemed way off. I then noticed that Adobe Acrobat counts coordinates in this view from the top left instead of bottom left. This means that Y needs to be calculated.

    I solved this in code like this:

    var rect = new RectangleJ(GetPostScriptPoints(4.19f), 
        GetPostScriptPoints(GetInverseCoordinateInInches(pdfReader, 1, 1.42f)),
        GetPostScriptPoints(3.5f), GetPostScriptPoints(0.39f));
    
    RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
    ITextExtractionStrategy strategy = new FilteredTextRenderListener(
            new LocationTextExtractionStrategy(), filter);
    var output = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
    
    private float GetPostScriptPoints(float inch)
    {
        return inch * 72;
    }
    
    private float GetInverseCoordinateInInches(PdfReader pdfReader, int pageIndex, float coordinateInInches)
    {
        Rectangle mediabox = pdfReader.GetPageSize(pageIndex); 
        return mediabox.Height / 72 - coordinateInInches; 
    }
    

    This worked but I think it looks a little messy. I then used the tool Prepare Form in Adobe Acrobat Pro and here the Y coordinate showed up correctly when looking at Text Field Properties. It could also convert the box into points right away.

    This means I could write code like this instead:

    var rect = new RectangleJ(301.68f, 738f, 252f, 28.08f);
    
    RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
    ITextExtractionStrategy strategy = new FilteredTextRenderListener(
            new LocationTextExtractionStrategy(), filter);
    var output = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
    

    This was a lot cleaner and faster so this was the way I choose to do it in the end.

    See this answer if you would like to get a value from a specific location for every page in the document:

    https://stackoverflow.com/a/20959388/3850405

提交回复
热议问题