How to unlock a “secured” (read-protected) PDF in Python?

倾然丶 夕夏残阳落幕 提交于 2019-12-02 20:52:23
Jaza

As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable to True isn't going to help you.

Per this thread:

Does a library exist to remove passwords from PDFs programmatically?

I would recommend removing the read-protection with a command-line tool such as qpdf (easily installable, e.g. on Ubuntu use apt-get install qpdf if you don't have it already):

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

Then open the unlocked file with pdfminer and do your stuff.

For a pure-Python solution, you can try using PyPDF2 and its .decrypt() method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf - see:

https://github.com/mstamy2/PyPDF2/issues/53

I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.

The code to use this is pretty straightforward:

import pikepdf

pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')

In my case there was no password, but simply setting check_extractable=False circumvented the PDFTextExtractionNotAllowed exception for a problematic file (that opened fine in other viewers).

The 'check_extractable=True' argument is by design. Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!