Reading text from PDF in .NET

前端未结

关注

 1  822

I am trying to read text from a PDF into a string using the iTextSharp library.

iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(@


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-01-13 08:57
              
            
            
                                                                       
In the content stream of a PDF there's no notion of "words".  So in iText(Sharp)'s text extraction implementation there are some heuristics to determine how to group characters into words.  When the distance between 2 characters is larger than half the width of a space in the current font, whitespace is inserted.

Most likely, the text that gets extracted without whitespace has distances between the words that are smaller than "spacewidth / 2".

In SimpleTextExtractionStrategy.RenderText():

if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
    AppendTextChunk(' ');
}


You can extend SimpleTextExtractionStrategy and adjust the RenderText().

In LocationTextExtractionStrategy it is more convenient.  You only need to override IsChunkAtWordBoundary():

protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
    float dist = chunk.DistanceFromEndOf(previousChunk);
    if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
        return true;

     return false;
}


You'll have to experiment a bit to get good results for your PDFs.  "spacewidth / 2" is apparently too large in your case.  But if you adjust it to be too small, you'll get false positives: whitespace will be inserted within words.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复