问题
I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :-
filename = 'CS1.pdf'
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
print(pageObj)
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method='tesseract', language='eng')
print(text)
output:-
Topursuegraduatestudiesincomputerscienceandengineering
how can i expect
To,pursue,graduate,studies,in,computer,science,and,engineering
回答1:
Please try to add
text += pageObj.extractText()
print(pageObj.extractText())
How does the text look at that stage before the concatenation?
I might have found the reason. Download iText RUPS to inspect the pdf. This tool shows how the content is rendered and placed on the page.
Navigate to Stream
In the lower right corner you can read
I am not familiar with the PDF spec, but this answer states
These numbers adjust the respective text position by that amount. Numbers are expressed in thousandths of a unit of text space. According to the official PDF spec, this "amount shall be subtracted from the current horizontal or vertical coordinate". A positive number therefor moves the next string to the left when writing horizontally. A negative number moves it to the right.
My suspicion is that PyPDF2
does not interpret a number as space. This is probably not that easy as you have to know how many pixels equal a space character.
I had a quick look in another pdfs and the text with spaces instead of numbers in between is read correctly. Please try that.
If this is the problem your next move could be to iterate the elements as shown in iText RUPS directly. It is a bit cumbersome but possible. You can find examples for PyPDF2
.
来源:https://stackoverflow.com/questions/52605515/how-to-comma-separate-words-when-using-pypdf2-library