pypdf

pypdf python tool

一个人想着一个人 提交于 2019-12-22 22:53:07
问题 Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf # -*- coding: utf-8 -*- from pyPdf import PdfFileWriter, PdfFileReader import pyPdf def getPDFContent(path): content = "" # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages for i in range(0, pdf.getNumPages()): # Extract text from page and add to content content += pdf.getPage(i).extractText() + "\n" # Collapse whitespace content = " ".join

How can I extract a JavaScript from a PDF file with a command line tool?

China☆狼群 提交于 2019-12-20 14:07:56
问题 How can I extract a JavaScript object from a PDF file using a command line tool? I am trying to make a GUI using Python with this function. I found these two modules but couldn't run them: pyPdf2 and pyPdf. 回答1: When you deal with JavaScript in PDFs, you have to be aware of two cases (which you cannot necessarily distinguish in advance, before closely investigating the file in question). "Harmless" JavaScript Malicious JavaScript Case 1: Harmless, "useful", "open" JavaScript The OP gave a

How to install poppler in ubuntu 15.04?

匆匆过客 提交于 2019-12-20 12:26:53
问题 Poppler is a PDF rendering library based on the xpdf-3.0 code base. I have already downloaded the tar.xz file from the official site http://poppler.freedesktop.org/ But I do not know what to do with this file Is there any command to install or run? P.S. - I am new to linux, so I don't know a lot about it yet.. 回答1: What you downloaded from poppler site is source code and you may not be expert enough to install it yourself. For such situations, Ubuntu and other linux distros manage packages of

How to install poppler in ubuntu 15.04?

白昼怎懂夜的黑 提交于 2019-12-20 12:26:44
问题 Poppler is a PDF rendering library based on the xpdf-3.0 code base. I have already downloaded the tar.xz file from the official site http://poppler.freedesktop.org/ But I do not know what to do with this file Is there any command to install or run? P.S. - I am new to linux, so I don't know a lot about it yet.. 回答1: What you downloaded from poppler site is source code and you may not be expert enough to install it yourself. For such situations, Ubuntu and other linux distros manage packages of

Cannot install PyPdf 2 module

♀尐吖头ヾ 提交于 2019-12-18 14:53:12
问题 Trying to install PyPdf2 module, I downloaded the zip and unzipped it, I executed python setup.py build and python setup.py install , but it seems that it has not been installed , when I try to import it from a python script, it returns an ImportError : import pyPdf Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named pyPdf Any help please. I'm using python 2.7 under windows XP. 回答1: It appears the README file for PyPDF2 is incorrect. It suggests

How do I install pyPDF2 module using windows?

无人久伴 提交于 2019-12-18 10:44:36
问题 As a newbie... I am having difficulties installing pyPDF2 module. I have downloaded. Where and how do I install (setup.py) so I can use module in python interpreter? 回答1: To install setup.py files under Windows you can choose this way with the command line: hit windows key type cmd excute the command line (black window) type cd C:\Users\User\Downloads\pyPDF2 to go into the directory where the setup.py is (this is mine if I downloaded it) The path can be copied from the explorer window. type

pyPdf ignores newlines in PDF file

◇◆丶佛笑我妖孽 提交于 2019-12-18 03:43:48
问题 I'm trying to extract each page of a PDF as a string: import pyPdf pages = [] pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb')) for i in range(0, pdf.getNumPages()): this_page = pdf.getPage(i).extractText() + "\n" this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split()) pages.append(this_page.encode("ascii", "xmlcharrefreplace")) for page in pages: print '*' * 80 print page But this script ignore newline characters, leaving me with messy strings like information concerning

Detect and alter strings in PDFs

社会主义新天地 提交于 2019-12-17 20:47:24
问题 I want to be able to detect a pattern in a PDF and somehow flag it. For instance, in this PDF, there's the string *2 . I want to be able to parse the PDF, detect all instances of *[integer] , and do something to call attention to the matches (like highlight them yellow or add a symbol in the margin). I would prefer to do this in Python, but I'm open to other languages. So far, I've been able to use pyPdf to read the PDF's text. I can use a regex to detect the pattern. But I haven't been able

Cropping pages of a .pdf file

最后都变了- 提交于 2019-12-17 10:29:21
问题 I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size. After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random. Has anyone had any experience with this? Code examples would be well appreciated, preferably

pypdf Merging multiple pdf files into one pdf

房东的猫 提交于 2019-12-17 10:26:58
问题 If I have 1000+ pdf files need to be merged into one pdf, input = PdfFileReader() output = PdfFileWriter() filename0000 ----- filename 1000 input = PdfFileReader(file(filename, "rb")) pageCount = input.getNumPages() for iPage in range(0, pageCount): output.addPage(input.getPage(iPage)) outputStream = file("document-output.pdf", "wb") output.write(outputStream) outputStream.close() Execute the above code,when input = PdfFileReader(file(filename500+, "rb")) , An error message: IOError: [Errno