pypdf2

PyPDF2 split pdf by pages

女生的网名这么多〃 提交于 2019-12-22 06:33:02
问题 I wanna split pdf file using PyPDF2. All examples in net is too difficult or don't work or always give error "AttributeError: 'PdfFileWriter' object has no attribute 'stream'" Can someone help with it ? Need separete one pdf with 3 pages into three different files. I'm starting from that: pdfFileObj = open(r"D:\BPO\act.pdf", 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pdfWriter = PyPDF2.PdfFileWriter() pdfWriter.addPage(pdfReader.getPage(0)) But don't know what to do next :( EDIT#1 Was

Change metadata of pdf file with pypdf2

谁都会走 提交于 2019-12-21 20:58:55
问题 I want to add a metadata key-value pair to the metadata of a pdf file. I found a several years old answer, but I think this is way to complicated. I guess there is an easier way today: https://stackoverflow.com/a/3257340/633961 I am not married with pypdf2, if there is an easier way, then I go this way? 回答1: You can do that using pdfrw pip install pdfrw Then run from pdfrw import PdfReader, PdfWriter trailer = PdfReader("myfile.pdf") trailer.Info.WhoAmI = "Tarun Lalwani" PdfWriter("edited.pdf

PyPDF2 complete clone of file

本秂侑毒 提交于 2019-12-13 18:57:08
问题 I am trying to copy a PDF in its entirety using PyPDF2, the following code copies the content but not the outline of the pdf. here is a sample pdf and use the code as follows python test.py <input pdf> <output dest> Here is the code that I have so far. from PyPDF2 import PdfFileWriter, PdfFileReader import sys import os.path def main(argv): if not os.path.isfile(argv[0]) and \ not os.path.isfile(argv[1]): print("Invalid path") sys.exit() input_pdf = PdfFileReader(open(argv[0], "rb")) output

How to comma separate words when using Pypdf2 library

不打扰是莪最后的温柔 提交于 2019-12-13 04:34:40
问题 I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :- filename = 'CS1.pdf' pdfFileObj = open(filename,'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) num_pages = pdfReader.numPages count = 0 text = "" while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 print(pageObj) text += pageObj.extractText() if text != "": text = text else: text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method

How extract extract specific text from pdf file - python

折月煮酒 提交于 2019-12-12 19:19:46
问题 I am trying to extract this text: DLA LAND AND MARITIME ACTIVE DEVICES DIVISION PO BOX 3990 COLUMBUS OH 43218-3990 USA Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930 Email: Desmond.Forshey@dla.mil from this pdf file. I was able to extract some text between two references using the code below: import PyPDF2 pdfFileObj = open('SPE7M518T446E.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) print(pdfReader.numPages) pageObj1 = pdfReader.getPage(0) pagecontent

PyPDF2 returning blank PDF after copy

允我心安 提交于 2019-12-12 12:27:03
问题 def EncryptPDFFiles(password, directory): pdfFiles = [] success = 0 # Get all PDF files from a directory for folderName, subFolders, fileNames in os.walk(directory): for fileName in fileNames: if (fileName.endswith(".pdf")): pdfFiles.append(os.path.join(folderName, fileName)) print("%s PDF documents found." % str(len(pdfFiles))) # Create an encrypted version for each document for pdf in pdfFiles: # Copy old PDF into a new PDF object pdfFile = open(pdf,"rb") pdfReader = PyPDF2.PdfFileReader

how to break looping in tkinter?

守給你的承諾、 提交于 2019-12-11 17:47:20
问题 this is my code. from PyPDF2 import PdfFileReader import tkinter as tk from tkinter import ttk from tkinter import filedialog root = tk.Tk() label_list = [] def get_info(path): with open(path, 'rb') as f: pdf = PdfFileReader(f) info = pdf.getDocumentInfo() page = pdf.getPage(4) label_list[0].config(text = "Title") label_list[1].config(text = info.title) label_list[2].config(text = "Author") label_list[3].config(text = info.author) label_list[4].config(text = "Subject") label_list[5].config

Python: Numbering Pages in a PDF using PyPDF2 and io

帅比萌擦擦* 提交于 2019-12-11 11:26:58
问题 So I am trying to retrospectively add a page numbering to a PDF file. I don't understand how this works. I copied the code together from here and here. I keep a problem I can't seem to fix on my own, probably because I don't understand what is happening even after reading the PyPDF2 documentation. from PyPDF2 import PdfFileWriter, PdfFileReader import io from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import A4 packet = io.BytesIO() can = canvas.Canvas(packet, pagesize=A4)

PyPDF2 ignores content, gets watermark only

时间秒杀一切 提交于 2019-12-11 07:29:33
问题 I have thousands of PDF files like this one. I'm trying to use PyPDF2 to convert them to plain text (code is below). But PyPDF2 apparently only "sees" the watermarks, not the content itself. What could I do here? import os import PyPDF2 path_to_pdfs = '/path/to/pdf/files/' for filename in os.listdir(path_to_pdfs): if '.pdf' in filename.lower(): with open(path_to_pdfs + filename, mode = 'rb') as f: txt = '' pdf_reader = PyPDF2.PdfFileReader(f) num_pages = pdf_reader.numPages for page in range

PyPDF2: Why does PdfFileWriter forget changes I made to a document?

末鹿安然 提交于 2019-12-10 20:08:37
问题 I am trying to modify text in a PDF file. The text can be in an object of type Tj or BDC . I find the correct objects and if I read them directly after changing them they show the updated values. But if I pass the complete page to PdfFileWriter the change is lost. I might be updating a copy and not the real object. I checked the id() and it was different. Does someone have an idea how to fix this? from PyPDF2 import PdfFileReader, PdfFileWriter from PyPDF2.pdf import ContentStream from PyPDF2