Extracting text from highlighted annotations in a PDF file

前端 未结 1 1478
夕颜
夕颜 2021-02-04 19:40

Since yesterday I\'m trying to extract the text from some highlighted annotations in one pdf, using python-poppler-qt4.

According to this documentation, looks like I hav

相关标签:
1条回答
  • 2021-02-04 20:19

    Looking at the documentation for Annotations it seems that the boundary property Returns this annotation's boundary rectangle in normalized coordinates. Although this seems a strange decision we can simply scale the coordinates by the page.pageSize().width() and .height() values.

    import popplerqt4
    import sys
    import PyQt4
    
    
    def main():
    
        doc = popplerqt4.Poppler.Document.load(sys.argv[1])
        total_annotations = 0
        for i in range(doc.numPages()):
            #print("========= PAGE {} =========".format(i+1))
            page = doc.page(i)
            annotations = page.annotations()
            (pwidth, pheight) = (page.pageSize().width(), page.pageSize().height())
            if len(annotations) > 0:
                for annotation in annotations:
                    if  isinstance(annotation, popplerqt4.Poppler.Annotation):
                        total_annotations += 1
                        if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
                            quads = annotation.highlightQuads()
                            txt = ""
                            for quad in quads:
                                rect = (quad.points[0].x() * pwidth,
                                        quad.points[0].y() * pheight,
                                        quad.points[2].x() * pwidth,
                                        quad.points[2].y() * pheight)
                                bdy = PyQt4.QtCore.QRectF()
                                bdy.setCoords(*rect)
                                txt = txt + unicode(page.text(bdy)) + ' '
    
                            #print("========= ANNOTATION =========")
                            print(unicode(txt))
    
        if total_annotations > 0:
            print str(total_annotations) + " annotation(s) found"
        else:
            print "no annotations found"
    
    if __name__ == "__main__":
        main()
    

    Additionally, I decided to concatenate the .highlightQuads() to get a better representation of what was actually highlighted.

    Please be aware of the explicit <space> I have appended to each quad region of text.

    In the example document the returned QString could not be passed directly to print() or str(), the solution to this was to use unicode() instead.

    I hope this helps someone as it helped me.

    Note: Page rotation may affect the scaling values, I have not been able to test this.

    0 讨论(0)
提交回复
热议问题