How to comma separate words when using Pypdf2 library

不打扰是莪最后的温柔 提交于 2019-12-13 04:34:40

问题


I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :-

filename = 'CS1.pdf'      
pdfFileObj = open(filename,'rb')       
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)      
num_pages = pdfReader.numPages  
count = 0      
text = ""    

while count < num_pages:       
pageObj = pdfReader.getPage(count)  
    count +=1  
    print(pageObj)  
    text += pageObj.extractText()
if text != "":  
   text = text  
else:  
   text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method='tesseract', language='eng')
print(text)

output:-

Topursuegraduatestudiesincomputerscienceandengineering

how can i expect

To,pursue,graduate,studies,in,computer,science,and,engineering


回答1:


Please try to add

text += pageObj.extractText()
print(pageObj.extractText())

How does the text look at that stage before the concatenation?

I might have found the reason. Download iText RUPS to inspect the pdf. This tool shows how the content is rendered and placed on the page.

Navigate to Stream

In the lower right corner you can read

I am not familiar with the PDF spec, but this answer states

These numbers adjust the respective text position by that amount. Numbers are expressed in thousandths of a unit of text space. According to the official PDF spec, this "amount shall be subtracted from the current horizontal or vertical coordinate". A positive number therefor moves the next string to the left when writing horizontally. A negative number moves it to the right.

My suspicion is that PyPDF2 does not interpret a number as space. This is probably not that easy as you have to know how many pixels equal a space character.

I had a quick look in another pdfs and the text with spaces instead of numbers in between is read correctly. Please try that.

If this is the problem your next move could be to iterate the elements as shown in iText RUPS directly. It is a bit cumbersome but possible. You can find examples for PyPDF2.



来源:https://stackoverflow.com/questions/52605515/how-to-comma-separate-words-when-using-pypdf2-library

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!