read, highlight, save PDF programmatically

后端 未结 3 900
挽巷
挽巷 2021-01-04 11:37

I\'d like to write a small script (which will run on a headless Linux server) that reads a PDF, highlights text that matches anything in an array of strings that I pass, the

相关标签:
3条回答
  • 2021-01-04 11:52

    PDFlib has Python bindings and supports these operations. You will want with PDI if you want to open a PDF. http://www.pdflib.com/products/pdflib-family/pdflib-pdi/ and TET.

    Unfortunately, it is a commercial product. I have used this library in production in the past and it works great. The bindings are very functional and not so Python. I have seen some attempts to make them more Pythonic: https://github.com/alexhayes/pythonic-pdflib You will want to use: open_pdi_document().

    It sounds like you will want to do search highlighting of some sort:

    http://www.pdflib.com/tet-cookbook/tet-and-pdflib/highlight-search-terms/

    0 讨论(0)
  • 2021-01-04 11:53

    Have you tried looking at PDFMiner? It sounds like it does what you want.

    0 讨论(0)
  • 2021-01-04 12:01

    Yes, it is possible with a combination of pdfminer (pip install pdfminer.six) and PyPDF2.

    First, find the coordinates (e.g. like this). Then highlight it:

    #!/usr/bin/env python
    
    """Create sample highlight in a PDF file."""
    
    from PyPDF2 import PdfFileWriter, PdfFileReader
    
    from PyPDF2.generic import (
        DictionaryObject,
        NumberObject,
        FloatObject,
        NameObject,
        TextStringObject,
        ArrayObject
    )
    
    
    def create_highlight(x1, y1, x2, y2, meta, color=[0, 1, 0]):
        """
        Create a highlight for a PDF.
    
        Parameters
        ----------
        x1, y1 : float
            bottom left corner
        x2, y2 : float
            top right corner
        meta : dict
            keys are "author" and "contents"
        color : iterable
            Three elements, (r,g,b)
        """
        new_highlight = DictionaryObject()
    
        new_highlight.update({
            NameObject("/F"): NumberObject(4),
            NameObject("/Type"): NameObject("/Annot"),
            NameObject("/Subtype"): NameObject("/Highlight"),
    
            NameObject("/T"): TextStringObject(meta["author"]),
            NameObject("/Contents"): TextStringObject(meta["contents"]),
    
            NameObject("/C"): ArrayObject([FloatObject(c) for c in color]),
            NameObject("/Rect"): ArrayObject([
                FloatObject(x1),
                FloatObject(y1),
                FloatObject(x2),
                FloatObject(y2)
            ]),
            NameObject("/QuadPoints"): ArrayObject([
                FloatObject(x1),
                FloatObject(y2),
                FloatObject(x2),
                FloatObject(y2),
                FloatObject(x1),
                FloatObject(y1),
                FloatObject(x2),
                FloatObject(y1)
            ]),
        })
    
        return new_highlight
    
    
    def add_highlight_to_page(highlight, page, output):
        """
        Add a highlight to a PDF page.
    
        Parameters
        ----------
        highlight : Highlight object
        page : PDF page object
        output : PdfFileWriter object
        """
        highlight_ref = output._addObject(highlight)
    
        if "/Annots" in page:
            page[NameObject("/Annots")].append(highlight_ref)
        else:
            page[NameObject("/Annots")] = ArrayObject([highlight_ref])
    
    
    def main():
        pdf_input = PdfFileReader(open("samples/test3.pdf", "rb"))
        pdf_output = PdfFileWriter()
    
        page1 = pdf_input.getPage(0)
    
        highlight = create_highlight(89.9206, 573.1283, 376.849, 591.3563, {
            "author": "John Doe",
            "contents": "Lorem ipsum"
        })
    
        add_highlight_to_page(highlight, page1, pdf_output)
    
        pdf_output.addPage(page1)
    
        output_stream = open("output.pdf", "wb")
        pdf_output.write(output_stream)
    
    
    if __name__ == '__main__':
        main()
    
    0 讨论(0)
提交回复
热议问题