pypdf

Parsing a PDF with no /Root object using PDFMiner

北城余情 提交于 2019-11-27 18:10:44
问题 I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack trace: /usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser) 331 break 332 else: --> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?') 334 if self.catalog.get('Type') is not LITERAL_CATALOG: 335 if STRICT: PDFSyntaxError: No /Root object! - Is

How to get bookmark's page number

蹲街弑〆低调 提交于 2019-11-27 15:12:57
问题 from pyPdf import PdfFileReader f = open('document.pdf', 'rb') p = PdfFileReader(f) o = p.getOutlines() List object o consists of Dictionary objects pyPdf.pdf.Destination (bookmarks), which has many properties, but I can't find any referring page number of that bookmark How can I return page number of, let's say o[1] bookmark? For example o[1].page.idnum return number which is approximately 3 times bigger than referenced page number in PDF document, which I assume references some object

Cropping pages of a .pdf file

随声附和 提交于 2019-11-27 11:43:52
I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size. After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random. Has anyone had any experience with this? Code examples would be well appreciated, preferably in python. danio pypdf does what I expect in this area. Using the following script: #!/usr/bin/python #

How to extract text from a PDF file in Python?

馋奶兔 提交于 2019-11-27 10:22:44
问题 How can I extract text from a PDF file in Python? I tried the following: import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + " \n" content = " ".join(content.replace(u"\xa0", u" ").strip().split()) return content f = open('a.txt','w+') f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace")) f.close() But the result is as follows, rather

PDF - Remove White Margins

廉价感情. 提交于 2019-11-27 08:06:51
I would like to know a way to remove white margins from a PDF file. Just like Adobe Acrobat X Pro does. I understand it will not work with every PDF file. I would guess that the way to do it, is by getting the text margins, then cropping out of that margins. PyPdf is preferred. iText finds text margins based on this code: public void addMarginRectangle(String src, String dest) throws IOException, DocumentException { PdfReader reader = new PdfReader(src); PdfReaderContentParser parser = new PdfReaderContentParser(reader); PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT))

pyPdf for IndirectObject extraction

半城伤御伤魂 提交于 2019-11-27 01:40:52
问题 Following this example, I can list all elements into a pdf file import pyPdf pdf = pyPdf.PdfFileReader(open("pdffile.pdf")) list(pdf.pages) # Process all the objects. print pdf.resolvedObjects now, I need to extract a non-standard object from the pdf file. My object is the one named MYOBJECT and it is a string. The piece printed by the python script that concernes me is: {'/MYOBJECT': IndirectObject(584, 0)} The pdf file is this: 558 0 obj <</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox

How to read line by line in pdf file using PyPdf?

℡╲_俬逩灬. 提交于 2019-11-26 22:45:36
问题 I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows? Here is the code for reading the pdf pages: import pyPdf def getPDFContent(path): content = "" num_pages = 10 p = file(path, "rb") pdf = pyPdf.PdfFileReader(p) for i in range(0, num_pages): content += pdf.getPage(i).extractText() + "\n" content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content Update: The call code is this: f=

Opening pdf urls with pyPdf

偶尔善良 提交于 2019-11-26 16:37:48
问题 How would I open a pdf from url instead of from the disk Something like input1 = PdfFileReader(file("http://example.com/a.pdf", "rb")) I want to open several files from web and download a merge of all the files. 回答1: I think urllib2 will get you what you want. from urllib2 import Request, urlopen from pyPdf import PdfFileWriter, PdfFileReader from StringIO import StringIO url = "http://www.silicontao.com/ProgrammingGuide/other/beejnet.pdf" writer = PdfFileWriter() remoteFile = urlopen(Request

Whitespace gone from PDF extraction, and strange word interpretation

余生颓废 提交于 2019-11-26 16:31:57
问题 Using the snippet below, I've attempted to extract the text data from this PDF file. import pyPdf def get_text(path): # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages content = "" for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + "\n" # Extract text from page and add to content # Collapse whitespace content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content The output I obtain, however,is devoid of whitespace

Extract images from PDF without resampling, in python?

心不动则不痛 提交于 2019-11-26 15:53:53
How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page. I'm using python 2.7 but can use 3.x if required. Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: