pypdf

Change metadata of pdf file with pypdf

拥有回忆 提交于 2019-11-29 02:03:30
I'd like to create/modify the title of a pdf document using pypdf. It seems that the title is readonly. Is there a way to access this metadata r/w? If answer positive, a piece of code would be appreciated. Thanks You can manipulate the title with pyPDF (sort of). I came across this post on the reportlab-users listing: http://two.pairlist.net/pipermail/reportlab-users/2009-November/009033.html You can also use pypdf. http://pybrary.net/pyPdf/ This won't let you edit the metadata per se, but will let you read one or more pdf file(s) and spit them back out, possibly with new metadata. Here's the

How to get bookmark's page number

霸气de小男生 提交于 2019-11-29 00:11:15
from pyPdf import PdfFileReader f = open('document.pdf', 'rb') p = PdfFileReader(f) o = p.getOutlines() List object o consists of Dictionary objects pyPdf.pdf.Destination (bookmarks), which has many properties, but I can't find any referring page number of that bookmark How can I return page number of, let's say o[1] bookmark? For example o[1].page.idnum return number which is approximately 3 times bigger than referenced page number in PDF document, which I assume references some object smaller then page, as running .page.idnum on whole PDF document outline returns array of numbers which is

How to extract text from a PDF file in Python?

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 17:36:54
How can I extract text from a PDF file in Python? I tried the following: import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + " \n" content = " ".join(content.replace(u"\xa0", u" ").strip().split()) return content f = open('a.txt','w+') f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace")) f.close() But the result is as follows, rather than readable text: 728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&(

Parsing a PDF with no /Root object using PDFMiner

﹥>﹥吖頭↗ 提交于 2019-11-28 10:01:04
I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack trace: /usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser) 331 break 332 else: --> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?') 334 if self.catalog.get('Type') is not LITERAL_CATALOG: 335 if STRICT: PDFSyntaxError: No /Root object! - Is this really a PDF? Of course, I immediately checked to see whether or not these PDFs were corrupted, but

pyPdf for IndirectObject extraction

血红的双手。 提交于 2019-11-28 07:01:54
Following this example, I can list all elements into a pdf file import pyPdf pdf = pyPdf.PdfFileReader(open("pdffile.pdf")) list(pdf.pages) # Process all the objects. print pdf.resolvedObjects now, I need to extract a non-standard object from the pdf file. My object is the one named MYOBJECT and it is a string. The piece printed by the python script that concernes me is: {'/MYOBJECT': IndirectObject(584, 0)} The pdf file is this: 558 0 obj <</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources <</ColorSpace <</CS0 563 0 R>> /ExtGState <</GS0 568 0 R>>

PyPDF 2 Decrypt Not Working

烈酒焚心 提交于 2019-11-27 23:46:49
问题 Currently I am using the PyPDF 2 as a dependency. I have encountered some encrypted files and handled them as you normally would (in the following code): PDF = PdfFileReader(file(pdf_filepath, 'rb')) if PDF.isEncrypted: PDF.decrypt("") print PDF.getNumPages() My filepath looks something like "~/blah/FDJKL492019 21490 ,LFS.pdf" PDF.decrypt("") returns 1, which means it was successful. But when it hits print PDF.getNumPages(), it still raises the error, "PyPDF2.utils.PdfReadError: File has not

Opening pdf urls with pyPdf

☆樱花仙子☆ 提交于 2019-11-27 21:41:27
How would I open a pdf from url instead of from the disk Something like input1 = PdfFileReader(file("http://example.com/a.pdf", "rb")) I want to open several files from web and download a merge of all the files. I think urllib2 will get you what you want. from urllib2 import Request, urlopen from pyPdf import PdfFileWriter, PdfFileReader from StringIO import StringIO url = "http://www.silicontao.com/ProgrammingGuide/other/beejnet.pdf" writer = PdfFileWriter() remoteFile = urlopen(Request(url)).read() memoryFile = StringIO(remoteFile) pdfFile = PdfFileReader(memoryFile) for pageNum in xrange

Whitespace gone from PDF extraction, and strange word interpretation

梦想的初衷 提交于 2019-11-27 21:34:01
Using the snippet below, I've attempted to extract the text data from this PDF file. import pyPdf def get_text(path): # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages content = "" for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + "\n" # Extract text from page and add to content # Collapse whitespace content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content The output I obtain , however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text

How to read line by line in pdf file using PyPdf?

拟墨画扇 提交于 2019-11-27 19:05:01
I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows? Here is the code for reading the pdf pages: import pyPdf def getPDFContent(path): content = "" num_pages = 10 p = file(path, "rb") pdf = pyPdf.PdfFileReader(p) for i in range(0, num_pages): content += pdf.getPage(i).extractText() + "\n" content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content Update: The call code is this: f= open('test.txt','w') pdfl = getPDFContent("test.pdf").encode("ascii", "ignore") f.write(pdfl) f.close()

pypdf Merging multiple pdf files into one pdf

做~自己de王妃 提交于 2019-11-27 18:47:05
If I have 1000+ pdf files need to be merged into one pdf, input = PdfFileReader() output = PdfFileWriter() filename0000 ----- filename 1000 input = PdfFileReader(file(filename, "rb")) pageCount = input.getNumPages() for iPage in range(0, pageCount): output.addPage(input.getPage(iPage)) outputStream = file("document-output.pdf", "wb") output.write(outputStream) outputStream.close() Execute the above code,when input = PdfFileReader(file(filename500+, "rb")) , An error message: IOError: [Errno 24] Too many open files: I think this is a bug, If not, What should I do? I recently came across this