Reading text from PDF in .NET

前端 未结 1 822
说谎
说谎 2021-01-13 08:14

I am trying to read text from a PDF into a string using the iTextSharp library.

iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(@         


        
相关标签:
1条回答
  • 2021-01-13 08:57

    In the content stream of a PDF there's no notion of "words". So in iText(Sharp)'s text extraction implementation there are some heuristics to determine how to group characters into words. When the distance between 2 characters is larger than half the width of a space in the current font, whitespace is inserted.

    Most likely, the text that gets extracted without whitespace has distances between the words that are smaller than "spacewidth / 2".

    In SimpleTextExtractionStrategy.RenderText():

    if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
        AppendTextChunk(' ');
    }
    

    You can extend SimpleTextExtractionStrategy and adjust the RenderText().

    In LocationTextExtractionStrategy it is more convenient. You only need to override IsChunkAtWordBoundary():

    protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
        float dist = chunk.DistanceFromEndOf(previousChunk);
        if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
            return true;
    
         return false;
    }
    

    You'll have to experiment a bit to get good results for your PDFs. "spacewidth / 2" is apparently too large in your case. But if you adjust it to be too small, you'll get false positives: whitespace will be inserted within words.

    0 讨论(0)
提交回复
热议问题