Extract hyperlinks from PDF in Python

一笑奈何 提交于 2019-11-30 14:23:34

问题


I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.


回答1:


I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

PDFFile = open('File Location','rb')

PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):

    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()

    if pageObject.has_key(key):
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
            print u[ank][uri]

This I hope should give the links in your PDF. P.S: I haven't extensively tried this.




回答2:


This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.

It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.

Here is the code I used to get links on a PDFPage

annotationList = []
if page.annots:
    for annotation in page.annots.resolve():
        annotationDict = annotation.resolve()
        if str(annotationDict["Subtype"]) != "/Link":
            # Skip over any annotations that are not links
            continue
        position = annotationDict["Rect"]
        uriDict = annotationDict["A"].resolve()
        # This has always been true so far.
        assert str(uriDict["S"]) == "/URI"
        # Some of my URI's have spaces.
        uri = uriDict["URI"].replace(" ", "%20")
        annotationList.append((position, uri))

Then I defined a function like:

def getOverlappingLink(annotationList, element):
    for (x0, y0, x1, y1), url in annotationList:
        if x0 > element.x1 or element.x0 > x1:
            continue
        if y0 > element.y1 or element.y0 > y1:
            continue
        return url
    else:
        return None

which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.

In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.




回答3:


slightly modified version of Ashwin's Answer:

import PyPDF2
PDFFile = open("file.pdf",'rb')

PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if uri in u[ank].keys():
                print(u[ank][uri])



回答4:


The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).

I'd have thought it relatvely easy to process the annotations looking for type LNK though.




回答5:


Here's a version that creates a list of URLs in the simplest way I could find:

import PyPDF2

pdf = PyPDF2.PdfFileReader('filename.pdf')

urls = []
for page in range(pdf.numPages):
    pdfPage = pdf.getPage(page)
    try:
        for item in (pdfPage['/Annots']):
            urls.append(item['/A']['/URI'])
    except KeyError:
        pass


来源:https://stackoverflow.com/questions/27744210/extract-hyperlinks-from-pdf-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!