pypdf | 易学教程

Change metadata of pdf file with pypdf

阅读更多关于 Change metadata of pdf file with pypdf

I'd like to create/modify the title of a pdf document using pypdf. It seems that the title is readonly. Is there a way to access this metadata r/w? If answer positive, a piece of code would be appreciated. Thanks You can manipulate the title with pyPDF (sort of). I came across this post on the reportlab-users listing: http://two.pairlist.net/pipermail/reportlab-users/2009-November/009033.html You can also use pypdf. http://pybrary.net/pyPdf/ This won't let you edit the metadata per se, but will let you read one or more pdf file(s) and spit them back out, possibly with new metadata. Here's the

How to get bookmark's page number

阅读更多关于 How to get bookmark's page number

from pyPdf import PdfFileReader f = open('document.pdf', 'rb') p = PdfFileReader(f) o = p.getOutlines() List object o consists of Dictionary objects pyPdf.pdf.Destination (bookmarks), which has many properties, but I can't find any referring page number of that bookmark How can I return page number of, let's say o[1] bookmark? For example o[1].page.idnum return number which is approximately 3 times bigger than referenced page number in PDF document, which I assume references some object smaller then page, as running .page.idnum on whole PDF document outline returns array of numbers which is

How to extract text from a PDF file in Python?

阅读更多关于 How to extract text from a PDF file in Python?

How can I extract text from a PDF file in Python? I tried the following: import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + " \n" content = " ".join(content.replace(u"\xa0", u" ").strip().split()) return content f = open('a.txt','w+') f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace")) f.close() But the result is as follows, rather than readable text: 728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&(

Parsing a PDF with no /Root object using PDFMiner

阅读更多关于 Parsing a PDF with no /Root object using PDFMiner

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack trace: /usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser) 331 break 332 else: --> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?') 334 if self.catalog.get('Type') is not LITERAL_CATALOG: 335 if STRICT: PDFSyntaxError: No /Root object! - Is this really a PDF? Of course, I immediately checked to see whether or not these PDFs were corrupted, but

pyPdf for IndirectObject extraction

阅读更多关于 pyPdf for IndirectObject extraction

Following this example, I can list all elements into a pdf file import pyPdf pdf = pyPdf.PdfFileReader(open("pdffile.pdf")) list(pdf.pages) # Process all the objects. print pdf.resolvedObjects now, I need to extract a non-standard object from the pdf file. My object is the one named MYOBJECT and it is a string. The piece printed by the python script that concernes me is: {'/MYOBJECT': IndirectObject(584, 0)} The pdf file is this: 558 0 obj <</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources <</ColorSpace <</CS0 563 0 R>> /ExtGState <</GS0 568 0 R>>

PyPDF 2 Decrypt Not Working

阅读更多关于 PyPDF 2 Decrypt Not Working

问题 Currently I am using the PyPDF 2 as a dependency. I have encountered some encrypted files and handled them as you normally would (in the following code): PDF = PdfFileReader(file(pdf_filepath, 'rb')) if PDF.isEncrypted: PDF.decrypt("") print PDF.getNumPages() My filepath looks something like "~/blah/FDJKL492019 21490 ,LFS.pdf" PDF.decrypt("") returns 1, which means it was successful. But when it hits print PDF.getNumPages(), it still raises the error, "PyPDF2.utils.PdfReadError: File has not

Opening pdf urls with pyPdf

阅读更多关于 Opening pdf urls with pyPdf

How would I open a pdf from url instead of from the disk Something like input1 = PdfFileReader(file("http://example.com/a.pdf", "rb")) I want to open several files from web and download a merge of all the files. I think urllib2 will get you what you want. from urllib2 import Request, urlopen from pyPdf import PdfFileWriter, PdfFileReader from StringIO import StringIO url = "http://www.silicontao.com/ProgrammingGuide/other/beejnet.pdf" writer = PdfFileWriter() remoteFile = urlopen(Request(url)).read() memoryFile = StringIO(remoteFile) pdfFile = PdfFileReader(memoryFile) for pageNum in xrange

Whitespace gone from PDF extraction, and strange word interpretation

阅读更多关于 Whitespace gone from PDF extraction, and strange word interpretation

Using the snippet below, I've attempted to extract the text data from this PDF file. import pyPdf def get_text(path): # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages content = "" for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + "\n" # Extract text from page and add to content # Collapse whitespace content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content The output I obtain , however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text

How to read line by line in pdf file using PyPdf?

阅读更多关于 How to read line by line in pdf file using PyPdf?

I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows? Here is the code for reading the pdf pages: import pyPdf def getPDFContent(path): content = "" num_pages = 10 p = file(path, "rb") pdf = pyPdf.PdfFileReader(p) for i in range(0, num_pages): content += pdf.getPage(i).extractText() + "\n" content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content Update: The call code is this: f= open('test.txt','w') pdfl = getPDFContent("test.pdf").encode("ascii", "ignore") f.write(pdfl) f.close()

pypdf Merging multiple pdf files into one pdf

阅读更多关于 pypdf Merging multiple pdf files into one pdf

If I have 1000+ pdf files need to be merged into one pdf, input = PdfFileReader() output = PdfFileWriter() filename0000 ----- filename 1000 input = PdfFileReader(file(filename, "rb")) pageCount = input.getNumPages() for iPage in range(0, pageCount): output.addPage(input.getPage(iPage)) outputStream = file("document-output.pdf", "wb") output.write(outputStream) outputStream.close() Execute the above code，when input = PdfFileReader(file(filename500+, "rb")) , An error message： IOError: [Errno 24] Too many open files: I think this is a bug, If not, What should I do？ I recently came across this