How to erase text from PDF using Python

守給你的承諾、 提交于 2020-03-22 04:27:51

问题


I'm creating a python script to edit text from PDFs.

I have this Python code which allows me to add text into specific positions of a PDF file.

import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys

packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages 
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
    output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()

My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.

PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.


回答1:


A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.

You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)

After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.

Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.

And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.




回答2:


If you want to do a poor man's redaction with ReportLab and PyPDF2, you would create your replacement content with ReportLab. Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then:

#set a fill color to white:
c.setFillColorRGB(1,1,1)
# draw a rectangle
c.rect([your rectangle], fill=1)
# change color
c.setFillColorRGB(0,0,0)
c.drawString([text insert position], [text string])

save this PDF document you've created to a temporary file. Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage. Get page n of the other document, call it toModifyPage.

toModifyPage.mergePage(updatePage)

after you are done updating pages:

modifiedDoc.cloneDocumentFromReader(srcDoc)
modifiedDoc.write(outStream)

Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.



来源:https://stackoverflow.com/questions/45144293/how-to-erase-text-from-pdf-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!