问题
In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>
When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.
if check_extractable and not doc.is_extractable:
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable
is a simple attribute of the doc
, but I don't think it is as simple as changing .is_extractable
to True..
Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!
================================================
Below you will find the code with which I currently extract the text from non-read protected.
def getTextFromPDF(rawFile):
resourceManager = PDFResourceManager(caching=True)
outfp = StringIO()
device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
interpreter = PDFPageInterpreter(resourceManager, device)
fileData = StringIO()
fileData.write(rawFile)
for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
interpreter.process_page(page)
fileData.close()
device.close()
result = outfp.getvalue()
outfp.close()
return result
回答1:
As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable
to True
isn't going to help you.
Per this thread:
Does a library exist to remove passwords from PDFs programmatically?
I would recommend removing the read-protection with a command-line tool such as qpdf
(easily installable, e.g. on Ubuntu use apt-get install qpdf
if you don't have it already):
qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf
Then open the unlocked file with pdfminer
and do your stuff.
For a pure-Python solution, you can try using PyPDF2
and its .decrypt()
method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf
- see:
https://github.com/mstamy2/PyPDF2/issues/53
回答2:
I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.
The code to use this is pretty straightforward:
import pikepdf
pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')
回答3:
In my case there was no password, but simply setting check_extractable=False
circumvented the PDFTextExtractionNotAllowed
exception for a problematic file (that opened fine in other viewers).
回答4:
The 'check_extractable=True' argument is by design. Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.
来源:https://stackoverflow.com/questions/28192977/how-to-unlock-a-secured-read-protected-pdf-in-python