Search and replace placeholder text in PDF with Python

前端 未结 3 1927
情歌与酒
情歌与酒 2021-02-14 04:52

I need to generate a customized PDF copy of a template document. The easiest way - I thought - was to create a source PDF that has some placeholder text where customization need

相关标签:
3条回答
  • 2021-02-14 05:25

    There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.

    If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.

    From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.

    0 讨论(0)
  • 2021-02-14 05:33

    As another solution you may try Aspose.PDF Cloud SDK for Python, it provides the feature to replace text in a PDF document.

    First thing first, install the Aspose.PDF Cloud SDK for Python

    pip install asposepdfcloud
    

    Sample Code upload PDF file to your cloud storage and replace multiple strings in a PDF document

    import os 
    import asposepdfcloud 
    from asposepdfcloud.apis.pdf_api import PdfApi 
     
    # Get App key and App SID from https://aspose.cloud 
    pdf_api_client = asposepdfcloud.api_client.ApiClient( 
        app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 
        app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx') 
     
    pdf_api = PdfApi(pdf_api_client) 
    filename = '02_pages.pdf' 
    remote_name = '02_pages.pdf' 
     
    #upload PDF file to storage 
    pdf_api.upload_file(remote_name,filename) 
     
    #Replace Text 
    text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true') 
    text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true') 
    text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2]) 
     
    response = pdf_api.post_document_text_replace(remote_name, text_replace_list) 
    print(response)
    

    I'm developer evangelist at aspose.

    0 讨论(0)
  • 2021-02-14 05:44

    There is no definite solution but I found 2 solutions that works most of the time.

    In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:

    # Redact things that look like social security numbers, replacing the
    # text with X's.
    options.content_filters = [
            # First convert all dash-like characters to dashes.
            (
                    re.compile(u"Tom Xavier"),
                    lambda m : "XXXXXXX"
            ),
    
            # Then do an actual SSL regex.
            # See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
            (
                    re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
                    lambda m : "XXX-XX-XXXX"
            ),
    ]
    
    # Perform the redaction using PDF on standard input and writing to standard output.
    pdf_redactor.redactor(options)
    

    Full Example can be found here

    In ruby https://github.com/gettalong/hexapdf works for black out text. Example code:

    require 'hexapdf'
    
    class ShowTextProcessor < HexaPDF::Content::Processor
    
      def initialize(page, to_hide_arr)
        super()
        @canvas = page.canvas(type: :overlay)
        @to_hide_arr = to_hide_arr
      end
    
      def show_text(str)
        boxes = decode_text_with_positioning(str)
        return if boxes.string.empty?
        if @to_hide_arr.include? boxes.string
            @canvas.stroke_color(0, 0 , 0)
    
            boxes.each do |box|
              x, y = *box.lower_left
              tx, ty = *box.upper_right
              @canvas.rectangle(x, y, tx - x, ty - y).fill
            end
        end
    
      end
      alias :show_text_with_positioning :show_text
    
    end
    
    file_name = ARGV[0]
    strings_to_black = ARGV[1].split("|")
    
    doc = HexaPDF::Document.open(file_name)
    puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
    doc.pages.each.with_index do |page, index|
      processor = ShowTextProcessor.new(page, strings_to_black)
      page.process_contents(processor)
    end
    
    new_file_name = "#{file_name.split('.').first}_updated.pdf"
    doc.write(new_file_name, optimize: true)
    
    puts "Writing updated file [#{new_file_name}]."
    

    In this you can black out text on select text will be visible.

    0 讨论(0)
提交回复
热议问题