EOF marker not found while use PyPDF2 merge pdf file in python

问题

When I use the following code

from PyPDF2 import PdfFileMerger

merge = PdfFileMerger()

    for newFile in nlst:
        merge.append(newFile)
    merge.write('newFile.pdf')

Something happened as following:

raise utils.PdfReadError("EOF marker not found")

PyPDF2.utils.PdfReadError: EOF marker not found

Anybody could tell me what happened? Thanks

回答1:

PDF is a file format, where a pdf parser normally starts reading the file by reading some global information located at the end of the file. At the very end of the document there needs to be a line with the content of

%%EOF

This is a marker, where the pdf parser knows, that the PDF document ends here and that the global information it needs, should be before this (a startxref section).

I guess, that the error message you see, means, that one of the input documents was truncated and is missing this %%EOF-marker.

回答2:

I've also got that problem and got a solution.

First, python reads PDF as 'rb' or 'wb' as a binary read and write format.

END OF FILE

Occurs when that there was an open parenthesis somewhere on a line, but not a matching closing parenthesis. Python reached the end of the file while looking for the closing parenthesis.

Here is the 1 solution:

Close that file that you've opened earlier using this command

newfile.close()
Check whether that pdf is opened using other variable and again close it

Same_file_with_another_variable.close()

Now open it only once and use it , you are good to go.

回答3:

One simple solution for this problem (EOF marker not found). Open your .pdf file in other application (I used Libre office draw in Ubuntu 18.04). Then export the file as .pdf. Using this exported .pdf file the problem will not persist.

回答4:

After encountering this problem using camelot and PyPDF2, I did some digging and have solved the problem.

The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.

Illustration of what the EOF plus javascript looks like if you open it:

 b'>>\r\n',
 b'startxref\r\n',
 b'275824\r\n',
 b'%%EOF\r\n',
 b'\n',
 b'\n',
 b'<script type="text/javascript">\n',
 b'\twindow.parent.focus();\n',
 b'</script><!DOCTYPE html>\n',
 b'\n',
 b'\n',
 b'\n',

So you just need to truncate the file before the javascript begins.

Solution:

def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
    # find the line position of the EOF
    for i, x in enumerate(txt[::-1]):
        if b'%%EOF' in x:
            actual_line = len(pdf_stream_in)-i
            print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
            break

    # return the list up to that point
    return pdf_stream_in[:actual_line]

# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
    txt = (p.readlines())

# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)

# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
    f.writelines(txtx)

fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')

来源：https://stackoverflow.com/questions/45390608/eof-marker-not-found-while-use-pypdf2-merge-pdf-file-in-python

标签

python

pdf

pypdf2