pypdf

Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode

爱⌒轻易说出口 提交于 2019-12-05 14:48:49
My stuff: python 2.6 64 bit (with pyPdf-1.13.win32.exe installed). Wing IDE. Windows 7 64 bit. I got the following error: NotImplementedError: unsupported filter /LZWDecode When I ran the following code: from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re path = 'C:\\Users\\Homer\\Documents\\' # This is where I put my pdfs filelist = os.listdir(path) has_text_list = [] does_not_have_text_list = [] for pdf_name in filelist: pdf_file_with_directory = os.path.join(path, pdf_name) pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb')) for i in range(0, pdf.getNumPages(

Python text extraction does not work on some pdfs

梦想与她 提交于 2019-12-05 07:39:25
问题 I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this : url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf" #url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf" f = urlopen(Request(url)).read() fileInput = StringIO(f) pdf = PyPDF2.PdfFileReader(fileInput) print pdf.getNumPages() print pdf.getDocumentInfo() print pdf.getPage(1).extractText() I am

How to open a generated PDF file in browser?

耗尽温柔 提交于 2019-12-04 22:16:26
问题 I have written a Pdf merger which merges an original file with a watermark. What I want to do now is to open 'document-output.pdf' file in the browser by a Django view. I already checked Django's related articles, but since my approach is relatively different, I don't directly create the PDF object, using the response object as its "file.", so I am kind of lost. So, how can I do is in a Django view? from pyPdf import PdfFileWriter, PdfFileReader from reportlab.pdfgen.canvas import Canvas from

EOF marker not found - How to fix in PyPDF and PyPDF2?

眉间皱痕 提交于 2019-12-04 19:08:11
问题 I'm attempting to combine a few PDF files into a single PDF file using Python. I've tried both PyPDF and PyPDF2 - on some files, they both throw this same error: PdfReadError: EOF marker not found Here's my code (page_files) is a list of PDF file paths to combine: # use pypdf to combine pdf pages output = PdfFileWriter() for pf in page_files: filestream = file(pf, "rb") pdf = PdfFileReader(filestream) for num in range(pdf.getNumPages()): output.addPage(pdf.getPage(num)) # write final file

How to merge two landscape pdf pages using pyPdf

落花浮王杯 提交于 2019-12-04 08:22:55
I'm having trouble merging two PDF files with pyPdf. When I run the following code the the watermark (page1) looks fine, but the page2 has been rotated 90 degrees clockwise. Any ideas what's going on? from pyPdf import PdfFileWriter, PdfFileReader # PDF1: A4 Landscape page created in photoshop using PdfCreator, input1 = PdfFileReader(file("base.pdf", "rb")) page1 = input1.getPage(0) # PDF2: A4 Landscape page, text only, created using Pisa (www.xhtml2pdf.com) input2 = PdfFileReader(file("text.pdf", "rb")) page2 = input2.getPage(0) # Merge page1.mergePage(page2) # Output output = PdfFileWriter()

python and pyPdf - how to extract text from the pages so that there are spaces between lines

Deadly 提交于 2019-12-04 07:16:30
currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it.... This is a common problem with pdf parsing. You can also expect trailing

How to close pyPDF “PdfFileReader” Class file handle

冷暖自知 提交于 2019-12-04 00:59:06
问题 this should be very simple question, for which I couldn't find answer by Google search: How to close file handle opened by pyPDF "PdfFileReader" Class Here is snippet: import os.path from pyPdf import PdfFileReader fname = 'my.pdf' input = PdfFileReader(file(fname, "rb")) os.rename(fname, 'my_renamed.pdf') which raises error [32] Thanks 回答1: The operating system is preventing a file from being re-named while something else has it open. This is a Good Thing (tm). Python's with statement will

Python text extraction does not work on some pdfs

大兔子大兔子 提交于 2019-12-03 21:25:44
I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this : url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf" #url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf" f = urlopen(Request(url)).read() fileInput = StringIO(f) pdf = PyPDF2.PdfFileReader(fileInput) print pdf.getNumPages() print pdf.getDocumentInfo() print pdf.getPage(1).extractText() I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do

EOF marker not found - How to fix in PyPDF and PyPDF2?

随声附和 提交于 2019-12-03 13:31:53
I'm attempting to combine a few PDF files into a single PDF file using Python. I've tried both PyPDF and PyPDF2 - on some files, they both throw this same error: PdfReadError: EOF marker not found Here's my code (page_files) is a list of PDF file paths to combine: # use pypdf to combine pdf pages output = PdfFileWriter() for pf in page_files: filestream = file(pf, "rb") pdf = PdfFileReader(filestream) for num in range(pdf.getNumPages()): output.addPage(pdf.getPage(num)) # write final file outputStream = file(pdf_full_path, "wb") output.write(outputStream) outputStream.close() I've read a few

Highlight text in a PDF with Python [closed]

喜夏-厌秋 提交于 2019-12-03 03:32:08
I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now, I'd like to add highlighting feature on original PDF for the lines, where search terms was appeared. Yes, I wanna modifiy PDF files if it is necessary. Is there any way for highlight text inside in PDF file? Are PDFMiner or PyPDF2 or other Python library is able to do that? ... or can you recommand other, maybe external service for it? spacevillain