Parse annotations from a pdf

前端 未结 8 730
攒了一身酷
攒了一身酷 2020-11-28 21:15

I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net

相关标签:
8条回答
  • 2020-11-28 22:07

    The pdf-annots script can extract annotations from PDFs. It is built upon PDFMineer.six and produces output in markdown both for the highlighted text and any annotations made on it, such as comments on highlighted areas or popup boxes. The output would look similar to this:

     * Page 2 Highlight:
     > Underlying text that was highlighted
    
     Comment made on highlighted text.
    
     * Page 3 Highlight: "Short highlighted text" -- Short comment.
    
     * Page 4 Text: A note on the page.
    

    The full command options can be seen below.

    usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
                        [--print-filename] [-w COLS]
                        INFILE [INFILE ...]
    
    Extracts annotations from a PDF file in markdown format for use in reviewing.
    
    positional arguments:
      INFILE                PDF files to process
    
    optional arguments:
      -h, --help            show this help message and exit
    
    Basic options:
      -p, --progress        emit progress information
      -o OUTFILE            output file (default is stdout)
      -n COLS, --cols COLS  number of columns per page in the document (default: 2)
    
    Options controlling output format:
      -s [SEC [SEC ...]], --sections [SEC [SEC ...]]
                            sections to emit (default: highlights, comments, nits)
      --no-group            emit annotations in order, don't group into sections
      --print-filename      print the filename when it has annotations
      -w COLS, --wrap COLS  wrap text at this many output columns
    

    I haven't tried this out extensively, but it has been working well so far!

    0 讨论(0)
  • 2020-11-28 22:08

    Turns out the bindings were incomplete. It is now fixed. https://bugs.launchpad.net/poppler-python/+bug/397850

    0 讨论(0)
提交回复
热议问题