pypdf | 易学教程

Parsing a PDF with no /Root object using PDFMiner

阅读更多关于 Parsing a PDF with no /Root object using PDFMiner

问题 I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack trace: /usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser) 331 break 332 else: --> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?') 334 if self.catalog.get('Type') is not LITERAL_CATALOG: 335 if STRICT: PDFSyntaxError: No /Root object! - Is

How to get bookmark's page number

阅读更多关于 How to get bookmark's page number

问题 from pyPdf import PdfFileReader f = open('document.pdf', 'rb') p = PdfFileReader(f) o = p.getOutlines() List object o consists of Dictionary objects pyPdf.pdf.Destination (bookmarks), which has many properties, but I can't find any referring page number of that bookmark How can I return page number of, let's say o[1] bookmark? For example o[1].page.idnum return number which is approximately 3 times bigger than referenced page number in PDF document, which I assume references some object

Cropping pages of a .pdf file

阅读更多关于 Cropping pages of a .pdf file

I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size. After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random. Has anyone had any experience with this? Code examples would be well appreciated, preferably in python. danio pypdf does what I expect in this area. Using the following script: #!/usr/bin/python #

How to extract text from a PDF file in Python?

阅读更多关于 How to extract text from a PDF file in Python?

问题 How can I extract text from a PDF file in Python? I tried the following: import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + " \n" content = " ".join(content.replace(u"\xa0", u" ").strip().split()) return content f = open('a.txt','w+') f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace")) f.close() But the result is as follows, rather

PDF - Remove White Margins

阅读更多关于 PDF - Remove White Margins

I would like to know a way to remove white margins from a PDF file. Just like Adobe Acrobat X Pro does. I understand it will not work with every PDF file. I would guess that the way to do it, is by getting the text margins, then cropping out of that margins. PyPdf is preferred. iText finds text margins based on this code: public void addMarginRectangle(String src, String dest) throws IOException, DocumentException { PdfReader reader = new PdfReader(src); PdfReaderContentParser parser = new PdfReaderContentParser(reader); PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT))

pyPdf for IndirectObject extraction

阅读更多关于 pyPdf for IndirectObject extraction

问题 Following this example, I can list all elements into a pdf file import pyPdf pdf = pyPdf.PdfFileReader(open("pdffile.pdf")) list(pdf.pages) # Process all the objects. print pdf.resolvedObjects now, I need to extract a non-standard object from the pdf file. My object is the one named MYOBJECT and it is a string. The piece printed by the python script that concernes me is: {'/MYOBJECT': IndirectObject(584, 0)} The pdf file is this: 558 0 obj <</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox

How to read line by line in pdf file using PyPdf?

阅读更多关于 How to read line by line in pdf file using PyPdf?

问题 I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows? Here is the code for reading the pdf pages: import pyPdf def getPDFContent(path): content = "" num_pages = 10 p = file(path, "rb") pdf = pyPdf.PdfFileReader(p) for i in range(0, num_pages): content += pdf.getPage(i).extractText() + "\n" content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content Update: The call code is this: f=

Opening pdf urls with pyPdf

阅读更多关于 Opening pdf urls with pyPdf

问题 How would I open a pdf from url instead of from the disk Something like input1 = PdfFileReader(file("http://example.com/a.pdf", "rb")) I want to open several files from web and download a merge of all the files. 回答1: I think urllib2 will get you what you want. from urllib2 import Request, urlopen from pyPdf import PdfFileWriter, PdfFileReader from StringIO import StringIO url = "http://www.silicontao.com/ProgrammingGuide/other/beejnet.pdf" writer = PdfFileWriter() remoteFile = urlopen(Request

Whitespace gone from PDF extraction, and strange word interpretation

阅读更多关于 Whitespace gone from PDF extraction, and strange word interpretation

问题 Using the snippet below, I've attempted to extract the text data from this PDF file. import pyPdf def get_text(path): # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages content = "" for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + "\n" # Extract text from page and add to content # Collapse whitespace content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content The output I obtain, however,is devoid of whitespace

Extract images from PDF without resampling, in python?

阅读更多关于 Extract images from PDF without resampling, in python?

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page. I'm using python 2.7 but can use 3.x if required. Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: