Parse annotations from a pdf

前端 未结 8 729
攒了一身酷
攒了一身酷 2020-11-28 21:15

I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net

相关标签:
8条回答
  • 2020-11-28 21:44

    Somebody asked a similar question. I tried the code sample there and it did not work for me until I made a few functional and cosmetic changes.

    #!/usr/bin/ruby
    
    require 'pdf-reader'
    
    ARGV.each do |filename|
      PDF::Reader.open(filename) do |reader|
        puts "file: #{filename}"
        puts "page\tcomment"
        reader.pages.each do |page|
          annots_ref = page.attributes[:Annots]
          if annots_ref
            actual_annots = annots_ref.map { |a| reader.objects[a] }
            actual_annots.each do |actual_annot|
              unless actual_annot[:Contents].nil?
                puts "#{page.number}\t#{actual_annot[:Contents]}"
              end
            end
          end
        end       
      end
    end
    

    If saved as pdfannot.rb, chmod +x'ed and placed into your favourite PATH directory, usage is:

    ./pdfannot.rb <path>
    

    First time writing/editing/remixing Ruby code, so very open for suggestions. HTH.

    On a side note, finding this question earlier could have saved me from double work. Hopefully this question gets more attention in the future such that it is easier to find.

    0 讨论(0)
  • 2020-11-28 21:46

    Just in case somebody is looking for some working code. Here is a script I use.

    import poppler
    import sys
    import urllib
    import os
    
    def main():
      input_filename = sys.argv[1]
        # http://blog.hartwork.org/?p=612
      document = poppler.document_new_from_file('file://%s' % \
        urllib.pathname2url(os.path.abspath(input_filename)), None)
      n_pages = document.get_n_pages()
      all_annots = 0
    
      for i in range(n_pages):
            page = document.get_page(i)
            annot_mappings = page.get_annot_mapping ()
            num_annots = len(annot_mappings)
            if num_annots > 0:
                for annot_mapping in annot_mappings:
                    if  annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK':
                        all_annots += 1
                        print 'page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents())
    
      if all_annots > 0:
        print str(all_annots) + " annotation(s) found"
      else:
        print "no annotations found"
    
    if __name__ == "__main__":
        main()
    
    0 讨论(0)
  • 2020-11-28 21:52

    Here is a working example (ported from previous answer) extracting annotations with the python module popplerqt5: python3 extract.py sample.pdf

    import popplerqt5
    import argparse
    
    
    def extract(fn):
        doc = popplerqt5.Poppler.Document.load(fn)
        annotations = []
        for i in range(doc.numPages()):
            page = doc.page(i)
            for annot in page.annotations():
                contents = annot.contents()
                if contents:
                    annotations.append(contents)
                    print(f'page={i + 1} {contents}')
    
        print(f'{len(annotations)} annotation(s) found')
        return annotations
    
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('fn')
        args = parser.parse_args()
        extract(args.fn)
    
    0 讨论(0)
  • 2020-11-28 21:58

    You should DEFINITELY have a look at PyPDF2. This amazing library has incredible potential, you can extract whatever from a PDF, including images or comments. Try to start by examining what Acrobat Reader DC (Reader) can give you on a PDF’s comments. Take a simple PDF, annotate it (add some comments) with Reader and in the comments tab in the upper right corner, click the horizontal three dots and click Export All To Data File... and select the format with the extension xfdf. This creates a wonderful xml file which you can parse. The format is very transparent and self-evident.

    If, however, you cannot rely on a user clicking this and instead need to extract the same data from a PDF programmatically using python, do not despair, there is a solution. (Inspired by Extract images from PDF without resampling, in python?)

    Prerequisites:

    PyPDF2 (pip install PyPDF2)

    What Reader gives you in the above mentioned xfdf file, looks like this:

    <?xml version="1.0" ?>
    <xfdf xml:space="preserve" xmlns="http://ns.adobe.com/xfdf/">
        <annots>
            <caret IT="Replace" color="#0000FF" creationdate="D:20190221151519+01'00'" date="D:20190221151526+01'00'" flags="print" fringe="1.069520,1.069520,1.069520,1.069520" name="72f8d1b7-d878-4281-bd33-3a6fb4578673" page="0" rect="636.942000,476.891000,652.693000,489.725000" subject="Inserted Text" title="Admin">
                <contents-richtext>
                    <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                        <p dir="ltr">
                            <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"> comment1</span>
                        </p>
                    </body>
                </contents-richtext>
                <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,374.656000,941.008000,488.656000"/>
            </caret>
            <highlight color="#FFD100" coords="183.867000,402.332000,220.968000,402.332000,183.867000,387.587000,220.968000,387.587000" creationdate="D:20190221151441+01'00'" date="D:20190221151448+01'00'" flags="print" name="a18c7fb0-0af3-435e-8c32-1af2af3c46ea" opacity="0.399994" page="0" rect="179.930000,387.126000,224.904000,402.793000" subject="Highlight" title="Admin">
                <contents-richtext>
                    <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                        <p dir="ltr">
                            <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span>
                        </p>
                    </body>
                </contents-richtext>
                <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,288.332000,941.008000,402.332000"/>
            </highlight>
            <caret color="#0000FF" creationdate="D:20190221151452+01'00'" date="D:20190221151452+01'00'" flags="print" fringe="0.828156,0.828156,0.828156,0.828156" name="6bf0226e-a3fb-49bf-bc89-05bb671e1627" page="0" rect="285.877000,372.978000,298.073000,382.916000" subject="Inserted Text" title="Admin">
                <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,268.088000,941.008000,382.088000"/>
            </caret>
            <strikeout IT="StrikeOutTextEdit" color="#0000FF" coords="588.088000,497.406000,644.818000,497.406000,588.088000,477.960000,644.818000,477.960000" creationdate="D:20190221151519+01'00'" date="D:20190221151519+01'00'" flags="print" inreplyto="72f8d1b7-d878-4281-bd33-3a6fb4578673" name="6686b852-3924-4252-af21-c1b10390841f" page="0" rect="582.290000,476.745000,650.616000,498.621000" replyType="group" subject="Cross-Out" title="Admin">
                <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,383.406000,941.008000,497.406000"/>
            </strikeout>
        </annots>
        <f href="p1.pdf"/>
        <ids modified="ABB10FA107DAAA47822FB5D311112349" original="474F087D87E7E544F6DEB9E0A93ADFB2"/>
    </xfdf>
    

    Various types of comments are presented here as tags within an <annots> block. Python can give you almost the same data. To obtain it, have a look at what the output of the following script gives:

    import sys
    import PyPDF2, traceback
    
    try :
        src = sys.argv[1]
    except :
        src = r'/path/to/my/file.pdf'
    
    
    input1 = PyPDF2.PdfFileReader(open(src, "rb"))
    nPages = input1.getNumPages()
    
    for i in range(nPages) :
        page0 = input1.getPage(i)
        try :
            for annot in page0['/Annots'] :
                print annot.getObject()       # (1)
                print ''
        except : 
            # there are no annotations on this page
            pass
    

    The output for the same file as in the xfdf file above will look like this:

    {'/Popup': IndirectObject(192, 0), '/M': u"D:20190221151448+01'00'", '/CreationDate': u"D:20190221151441+01'00'", '/NM': u'a18c7fb0-0af3-435e-8c32-1af2af3c46ea', '/F': 4, '/C': [1, 0.81961, 0], '/Rect': [179.93, 387.126, 224.904, 402.793], '/Type': '/Annot', '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u'otrasneho', '/QuadPoints': [183.867, 402.332, 220.968, 402.332, 183.867, 387.587, 220.968, 387.587], '/Subj': u'Highlight', '/CA': 0.39999, '/AP': {'/N': IndirectObject(202, 0)}, '/Subtype': '/Highlight'}
    
    {'/Parent': IndirectObject(191, 0), '/Rect': [737.008, 288.332, 941.008, 402.332], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A425D0>, '/Subtype': '/Popup'}
    
    {'/Popup': IndirectObject(194, 0), '/M': u"D:20190221151452+01'00'", '/CreationDate': u"D:20190221151452+01'00'", '/NM': u'6bf0226e-a3fb-49bf-bc89-05bb671e1627', '/F': 4, '/C': [0, 0, 1], '/Subj': u'Inserted Text', '/Rect': [285.877, 372.978, 298.073, 382.916], '/Type': '/Annot', '/P': IndirectObject(5, 0), '/AP': {'/N': IndirectObject(201, 0)}, '/RD': [0.82816, 0.82816, 0.82816, 0.82816], '/T': u'Admin', '/Subtype': '/Caret'}
    
    {'/Parent': IndirectObject(193, 0), '/Rect': [737.008, 268.088, 941.008, 382.088], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42830>, '/Subtype': '/Popup'}
    
    {'/Popup': IndirectObject(196, 0), '/M': u"D:20190221151519+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'6686b852-3924-4252-af21-c1b10390841f', '/F': 4, '/IRT': IndirectObject(197, 0), '/C': [0, 0, 1], '/Rect': [582.29, 476.745, 650.616, 498.621], '/Type': '/Annot', '/T': u'Admin', '/P': IndirectObject(5, 0), '/QuadPoints': [588.088, 497.406, 644.818, 497.406, 588.088, 477.96, 644.818, 477.96], '/Subj': u'Cross-Out', '/IT': '/StrikeOutTextEdit', '/AP': {'/N': IndirectObject(200, 0)}, '/RT': '/Group', '/Subtype': '/StrikeOut'}
    
    {'/Parent': IndirectObject(195, 0), '/Rect': [737.008, 383.406, 941.008, 497.406], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AF0>, '/Subtype': '/Popup'}
    
    {'/Popup': IndirectObject(198, 0), '/M': u"D:20190221151526+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'72f8d1b7-d878-4281-bd33-3a6fb4578673', '/F': 4, '/C': [0, 0, 1], '/Rect': [636.942, 476.891, 652.693, 489.725], '/Type': '/Annot', '/RD': [1.06952, 1.06952, 1.06952, 1.06952], '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment1</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u' pica', '/Subj': u'Inserted Text', '/IT': '/Replace', '/AP': {'/N': IndirectObject(212, 0)}, '/Subtype': '/Caret'}
    
    {'/Parent': IndirectObject(197, 0), '/Rect': [737.008, 374.656, 941.008, 488.656], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AB0>, '/Subtype': '/Popup'}
    

    If you examine the output, you will realize that the outputs are all more or less the same. Every comment in the xfdf file has two counterparts in PyPDF2’s output in python. The /C attribute is the color of the highlight, in RGB, scaled to floats in the range <0, 1>. /Rect defines the bounding box of the comment on the page/spread, in points (1/72 of an inch) relative to the lower-left corner of the page, increasing values going right and up. /M and /CreationDate are modified and creation times, /QuadPoints is an array of [x1, y1, x2, y2, ..., xn, yn] coordinates of a line around the comment, /Subject, /Type, /SubType, /IT identify the type of the comment, /T is probably the creator, /RC is an xhtml representation of the comment’s text if there is one. If there is an ink-drawn comment, it will be presented here as having an attribute /InkList with data in the form [[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]] for line 1, line 2, ..., line m.

    For a more thorough explanation of the various fields you get from getObject() in the given python code lebeled as line (1), please consult https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf and especially the section 12.5 Annotations starting at pages 381–413.

    0 讨论(0)
  • 2020-11-28 21:58

    I didn't ever used this, nor I wanted this kind of features, but I found PDFMiner - this link has information about basic usage, maybe this is what You are looking for?

    0 讨论(0)
  • 2020-11-28 21:58

    The author @JorjMcKie of PyMuPDF wrote a snippet for me and I modified a bit:

    import fitz  # to import the PyMuPDF library
    # from pprint import pprint
    
    
    def _parse_highlight(annot: fitz.Annot, wordlist: list) -> str:
        points = annot.vertices
        quad_count = int(len(points) / 4)
        sentences = ['' for i in range(quad_count)]
        for i in range(quad_count):
            r = fitz.Quad(points[i * 4: i * 4 + 4]).rect
            words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
            sentences[i] = ' '.join(w[4] for w in words)
        sentence = ' '.join(sentences)
        return sentence
    
    
    def main() -> dict:
        doc = fitz.open('path/to/your/file')
        page = doc[0]
    
        wordlist = page.getText("words")  # list of words on page
        wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x
    
        highlights = {}
        annot = page.firstAnnot
        i = 0
        while annot:
            if annot.type[0] == 8:
                highlights[i] = _parse_highlight(annot, wordlist)
                i += 1
                print('> ' + highlights[i] + '\n')
            annot = annot.next
    
        # pprint(highlights)
        return highlights
    
    
    if __name__ == "__main__":
        main()
    
    • https://github.com/pymupdf/PyMuPDF
    • Is it posible to extract highlighted text? #318 https://github.com/pymupdf/PyMuPDF/issues/318#issuecomment-657102559

    Though there are still some small typos in the results:

    > system upsets,
    
    > expansion of smart grid monitoring devices that generally provide nodal voltages and power injections at fine spatial resolution,
    
    > hurricanes to indi- vidual lightning strikes),
    
    0 讨论(0)
提交回复
热议问题