Return text string from physical coordinates in a PDF with Python

前端 未结 2 533
生来不讨喜
生来不讨喜 2021-02-03 15:14

I have been battling with Google and the limited documentation of PDFMiner for the last several hours, and although I feel close, I\'m just not getting what I need. I\'ve worke

2条回答
  •  爱一瞬间的悲伤
    2021-02-03 16:07

    I've been writing a library to try to simplify this process, pdfquery. To extract text from a particular place in a particular page, you would do:

    pdf = pdfquery.PDFQuery(file)
    # load first, third, fourth pages
    pdf.load(0, 2, 3) 
    # find text between 100 and 300 points from left bottom corner of first page
    text = pdf.pq('LTPage[page_index=0] :in_bbox("100,100,300,300")').text() 
    # save tree as XML to try to figure out why the last line didn't work the way you expected :)
    pdf.tree.write(filename, pretty_print=True)
    

    If you want to find individual characters within that box, instead of text lines entirely within that box, pass merge_tags=None to PDFQuery (by default it merges consecutive characters into a single element to make the tree less ridiculous, so the whole line would have to be inside the box). If you want to find anything that partially overlaps the box, use :overlaps_bbox instead of :in_bbox.

    This is basically using PyQuery selector syntax to grab text from a PDFMiner layout, so if your document is too messy for PDFMiner, it may be too messy for this as well, but at least it will be faster to play with.

提交回复
热议问题