Extracting text from a rectangle using iText ( .Net ) does give me the entire line

…衆ロ難τιáo~ 提交于 2020-12-06 19:19:51

问题


The following is the code (using iText for.Net Version 7.0.4.0) that i am using for extracting the text from a pdf. What i have observed during my testing is it works well by only extracting the content within a rectangle for most of the pdf's. But for few of them it gives the entire line from the pdf. I know

that the text snippets that intersect with the rect (so part of the text may be outside rect, iText doesn't cut text snippets in pieces).

But I want to understand what parameter in the pdf will be used in iText to split text.

        var reader = new PdfReader( filePath );
        PdfDocument pdfDoc = new PdfDocument( reader );

        var addressRect = new Rectangle( 33, 190, 70, 42 ); // 

        var addressRegionFilter = new TextRegionEventFilter( addressRect );
        var filterListener = new FilteredTextEventListener( new LocationTextExtractionStrategy(), addressRegionFilter );
        var addressText = PdfTextExtractor.GetTextFromPage( pdfDoc.GetPage( 1 ), filterListener );

        pdfDoc.Close();

回答1:


This should do the trick.

class RectangleTextExtractionStrategy implements ITextExtractionStrategy
{

    private ITextExtractionStrategy innerStrategy = null;
    private Rectangle rectangle;

    public RectangleTextExtractionStrategy(ITextExtractionStrategy strategy, Rectangle rectangle)
    {
        this.innerStrategy = strategy;
        this.rectangle = rectangle;
    }

    @Override
    public String getResultantText() {
        return innerStrategy.getResultantText();
    }

    @Override
    public void eventOccurred(IEventData iEventData, EventType eventType) {
        if(eventType != EventType.RENDER_TEXT)
            return;
        TextRenderInfo tri = (TextRenderInfo) iEventData;
        for(TextRenderInfo subTri : tri.getCharacterRenderInfos())
        {
            Rectangle r2 = new CharacterRenderInfo(subTri).getBoundingBox();
            if(intersects(r2))
               innerStrategy.eventOccurred(subTri, EventType.RENDER_TEXT);
        }
    }

    private boolean intersects(Rectangle rectangle)
    {
        // # TODO
        return true;
    }

    @Override
    public Set<EventType> getSupportedEvents() {
        return innerStrategy.getSupportedEvents();
    }
}

The idea here is to split all incoming TextRenderInfo objects into the corresponding events for their characters. Then (if they are in the search region) we delegate the call to another ITextExtractionStrategy.



来源:https://stackoverflow.com/questions/46396905/extracting-text-from-a-rectangle-using-itext-net-does-give-me-the-entire-li

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!