问题
When I use the following code
from PyPDF2 import PdfFileMerger
merge = PdfFileMerger()
for newFile in nlst:
merge.append(newFile)
merge.write('newFile.pdf')
Something happened as following:
raise utils.PdfReadError("EOF marker not found")
PyPDF2.utils.PdfReadError: EOF marker not found
Anybody could tell me what happened? Thanks
回答1:
PDF is a file format, where a pdf parser normally starts reading the file by reading some global information located at the end of the file. At the very end of the document there needs to be a line with the content of
%%EOF
This is a marker, where the pdf parser knows, that the PDF document ends here and that the global information it needs, should be before this (a startxref section).
I guess, that the error message you see, means, that one of the input documents was truncated and is missing this %%EOF-marker.
回答2:
I've also got that problem and got a solution.
First, python reads PDF as 'rb'
or 'wb'
as a binary read and write format.
END OF FILE
Occurs when that there was an open parenthesis somewhere on a line, but not a matching closing parenthesis. Python reached the end of the file while looking for the closing parenthesis.
Here is the 1 solution:
Close that file that you've opened earlier using this command
newfile.close()
Check whether that pdf is opened using other variable and again close it
Same_file_with_another_variable.close()
Now open it only once and use it , you are good to go.
回答3:
One simple solution for this problem (EOF marker not found). Open your .pdf file in other application (I used Libre office draw in Ubuntu 18.04). Then export the file as .pdf. Using this exported .pdf file the problem will not persist.
回答4:
After encountering this problem using camelot
and PyPDF2
, I did some digging and have solved the problem.
The end of file marker '%%EOF'
is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.
Illustration of what the EOF plus javascript looks like if you open it:
b'>>\r\n',
b'startxref\r\n',
b'275824\r\n',
b'%%EOF\r\n',
b'\n',
b'\n',
b'<script type="text/javascript">\n',
b'\twindow.parent.focus();\n',
b'</script><!DOCTYPE html>\n',
b'\n',
b'\n',
b'\n',
So you just need to truncate the file before the javascript begins.
Solution:
def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
# find the line position of the EOF
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(pdf_stream_in)-i
print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
break
# return the list up to that point
return pdf_stream_in[:actual_line]
# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
txt = (p.readlines())
# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)
# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
f.writelines(txtx)
fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')
来源:https://stackoverflow.com/questions/45390608/eof-marker-not-found-while-use-pypdf2-merge-pdf-file-in-python