My stuff: python 2.6 64 bit (with pyPdf-1.13.win32.exe installed). Wing IDE. Windows 7 64 bit.
I got the following error:
NotImplementedError: unsupported filter /LZWDecode
When I ran the following code:
from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re
path = 'C:\\Users\\Homer\\Documents\\' # This is where I put my pdfs
filelist = os.listdir(path)
has_text_list = []
does_not_have_text_list = []
for pdf_name in filelist:
pdf_file_with_directory = os.path.join(path, pdf_name)
pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))
for i in range(0, pdf.getNumPages()):
content = pdf.getPage(i).extractText() #this is the line what done it
does_it_have_text = re.findall(r'\w{2,}', content)
if does_it_have_text == []:
does_not_have_text_list.append(pdf_name)
print pdf_name
else:
has_text_list.append(pdf_name)
print does_not_have_text_list
Here's a little background. The path is full of pdfs. Some were saved from text documents using the Adobe pdf printer (at least I think that's how they did it). And some were scanned as images. I wanted to separate them and OCR the ones that are images (the non-image ones are perfect and ought not to be messed with).
I asked here a few days ago how to do that:
The only respose I got was in VB, and I only speaky the python. So I figured I would try to write an answer to my own question. My strategy (reflected in the code above) is like this. If it's just an image, then that regular expression will return an empty list. If it has text, the regular expression (says any word with 2 or more alphanumeric characters) will return a list populated with stuff like u'word' (in python, I think that's a unicode string).
So the code should work, and we can take the first step to finish off that other thread using open source software (separating the ocrd from imaged pdfs), but I don't know how to deal with this filter error and googling wasn't helpful. So if anyone knows, would be quite helpful.
I don't really know how to use this stuff. I'm not sure what filter means in pyPdf speak. I think it' saying that it can't really read the pdf or something, even though it's ocrd. Funnily, I put one of the non-ocrd and one of the ocrd pdfs in the same folder as a python file and this worked on just the one without the for loop, so I don't know why doing them with the for loop created the filter errror. I'll post the single code below. THX.
from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re
pdf = pyPdf.PdfFileReader(open(my_ocrd_file.pdf', 'rb'))
has_text_list = []
does_not_have_text_list = []
for i in range(0, pdf.getNumPages()):
content = pdf.getPage(i).extractText()
does_it_have_text = re.findall(r'\w{2,}', content)
print does_it_have_text
and it prints stuff, so I don't know why I get a filter error on one and not the other. When I run this code against the other file in the directory (the one that's NOT ocrd), the output is an emptry string on one line and an emptry string on the next, like so:
[]
[]
So I don't guess it's a filter problem with the non-ocrd pdfs either. This is like over my head and I need some help here.
Edit:
Google search found this, but I don't know what to make of it:
Replace pyPdf's filter.py with http://vaitls.com/treas/pdf/pyPdf/filters.py in your pyPdf source folder. That worked for me.
LZW is a compression format used in GIFs and sometimes in PDFs. If you look at the filters available in pyPdf.filters
you'll see that LZW is not there, hence the NotImplementedError.
The link you posted is to code in a subversion repository where someone has implemented a LZW filter.
来源:https://stackoverflow.com/questions/6053064/python-pypdf-adobe-pdf-ocr-error-unsupported-filter-lzwdecode