I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net
The pdf-annots script can extract annotations from PDFs. It is built upon PDFMineer.six and produces output in markdown both for the highlighted text and any annotations made on it, such as comments on highlighted areas or popup boxes. The output would look similar to this:
* Page 2 Highlight:
> Underlying text that was highlighted
Comment made on highlighted text.
* Page 3 Highlight: "Short highlighted text" -- Short comment.
* Page 4 Text: A note on the page.
The full command options can be seen below.
usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
[--print-filename] [-w COLS]
INFILE [INFILE ...]
Extracts annotations from a PDF file in markdown format for use in reviewing.
positional arguments:
INFILE PDF files to process
optional arguments:
-h, --help show this help message and exit
Basic options:
-p, --progress emit progress information
-o OUTFILE output file (default is stdout)
-n COLS, --cols COLS number of columns per page in the document (default: 2)
Options controlling output format:
-s [SEC [SEC ...]], --sections [SEC [SEC ...]]
sections to emit (default: highlights, comments, nits)
--no-group emit annotations in order, don't group into sections
--print-filename print the filename when it has annotations
-w COLS, --wrap COLS wrap text at this many output columns
I haven't tried this out extensively, but it has been working well so far!
Turns out the bindings were incomplete. It is now fixed. https://bugs.launchpad.net/poppler-python/+bug/397850