I am trying to read text from a PDF into a string using the iTextSharp library.
iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(@
In the content stream of a PDF there's no notion of "words". So in iText(Sharp)'s text extraction implementation there are some heuristics to determine how to group characters into words. When the distance between 2 characters is larger than half the width of a space in the current font, whitespace is inserted.
Most likely, the text that gets extracted without whitespace has distances between the words that are smaller than "spacewidth / 2".
In SimpleTextExtractionStrategy.RenderText()
:
if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
AppendTextChunk(' ');
}
You can extend SimpleTextExtractionStrategy
and adjust the RenderText()
.
In LocationTextExtractionStrategy
it is more convenient. You only need to override IsChunkAtWordBoundary()
:
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
float dist = chunk.DistanceFromEndOf(previousChunk);
if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
return true;
return false;
}
You'll have to experiment a bit to get good results for your PDFs. "spacewidth / 2" is apparently too large in your case. But if you adjust it to be too small, you'll get false positives: whitespace will be inserted within words.