Read all bookmarks from a PDF document and create a dictionary with PageNumber and Title of the bookmark

问题

I a trying to read a PDF document using Python with PyPDF2 package. The objective is to read all the bookmarks in the pdf and construct a dictionary with page numbers of the bookmark as keys and titles of bookmarks as values.

There is not much support on the internet on how to achieve it except for this article. The code posted in it doesn't work and i am not an expert in python to correct it. PyPDF2's reader object has a property named outlines which gives you a list of all bookmark objects but there are no page numbers for bookmarks and traversing the list is little difficult as there are no parent/child relationships between bookmarks.

I am sharing below my code to read a pdf document and inspect outlines property.

import PyPDF2

pdfObj = open('SomeDocument.pdf', 'rb')
readerObj = PyPDF2.PdfFileReader(pdfObj)

print(readerObj.numPages)
print(readerObj.outlines[1][1])

回答1:

The parent/child relationships are preserved by having the lists nested in each other. This sample code will display bookmarks recursively as an indented table of contents:

import PyPDF2


def show_tree(bookmark_list, indent=0):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call with increased indentation
            show_tree(item, indent + 4)
        else:
            print(" " * indent + item.title)


reader = PyPDF2.PdfFileReader("[your filename]")

show_tree(reader.getOutlines())

I don't know how to retrieve the page numbers. I tried with a few files, and the page attribute of a Destination object is always an instance of IndirectObject, which doesn't seem to contain any information about page number.

UPDATE:

There is a getDestinationPageNumber method to get page numbers from Destination objects. Modified code to create your desired dictionary:

import PyPDF2


def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            result[reader.getDestinationPageNumber(item)] = item.title
    return result


reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

However, note that you will overwrite and lose some values if there are multiple bookmarks on the same page (dictionary keys must be unique).

来源：https://stackoverflow.com/questions/54303318/read-all-bookmarks-from-a-pdf-document-and-create-a-dictionary-with-pagenumber-a

标签

python-3.x

pypdf2