Parse annotations from a pdf

前端未结

关注

 8  730

I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net

相关标签:

8条回答

说谎

2020-11-28 22:07

The pdf-annots script can extract annotations from PDFs. It is built upon PDFMineer.six and produces output in markdown both for the highlighted text and any annotations made on it, such as comments on highlighted areas or popup boxes. The output would look similar to this:

 * Page 2 Highlight:
 > Underlying text that was highlighted

 Comment made on highlighted text.

 * Page 3 Highlight: "Short highlighted text" -- Short comment.

 * Page 4 Text: A note on the page.

The full command options can be seen below.

usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
                    [--print-filename] [-w COLS]
                    INFILE [INFILE ...]

Extracts annotations from a PDF file in markdown format for use in reviewing.

positional arguments:
  INFILE                PDF files to process

optional arguments:
  -h, --help            show this help message and exit

Basic options:
  -p, --progress        emit progress information
  -o OUTFILE            output file (default is stdout)
  -n COLS, --cols COLS  number of columns per page in the document (default: 2)

Options controlling output format:
  -s [SEC [SEC ...]], --sections [SEC [SEC ...]]
                        sections to emit (default: highlights, comments, nits)
  --no-group            emit annotations in order, don't group into sections
  --print-filename      print the filename when it has annotations
  -w COLS, --wrap COLS  wrap text at this many output columns

I haven't tried this out extensively, but it has been working well so far!

0 讨论(0)

广开言路

2020-11-28 22:08

Turns out the bindings were incomplete. It is now fixed. https://bugs.launchpad.net/poppler-python/+bug/397850

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2