问题
I a trying to read a PDF document using Python with PyPDF2 package. The objective is to read all the bookmarks in the pdf and construct a dictionary with page numbers of the bookmark as keys and titles of bookmarks as values.
There is not much support on the internet on how to achieve it except for this article. The code posted in it doesn't work and i am not an expert in python to correct it. PyPDF2's reader object has a property named outlines which gives you a list of all bookmark objects but there are no page numbers for bookmarks and traversing the list is little difficult as there are no parent/child relationships between bookmarks.
I am sharing below my code to read a pdf document and inspect outlines property.
import PyPDF2
pdfObj = open('SomeDocument.pdf', 'rb')
readerObj = PyPDF2.PdfFileReader(pdfObj)
print(readerObj.numPages)
print(readerObj.outlines[1][1])
回答1:
The parent/child relationships are preserved by having the lists nested in each other. This sample code will display bookmarks recursively as an indented table of contents:
import PyPDF2
def show_tree(bookmark_list, indent=0):
for item in bookmark_list:
if isinstance(item, list):
# recursive call with increased indentation
show_tree(item, indent + 4)
else:
print(" " * indent + item.title)
reader = PyPDF2.PdfFileReader("[your filename]")
show_tree(reader.getOutlines())
I don't know how to retrieve the page numbers. I tried with a few files, and the page
attribute of a Destination
object is always an instance of IndirectObject
, which doesn't seem to contain any information about page number.
UPDATE:
There is a getDestinationPageNumber method to get page numbers from Destination
objects. Modified code to create your desired dictionary:
import PyPDF2
def bookmark_dict(bookmark_list):
result = {}
for item in bookmark_list:
if isinstance(item, list):
# recursive call
result.update(bookmark_dict(item))
else:
result[reader.getDestinationPageNumber(item)] = item.title
return result
reader = PyPDF2.PdfFileReader("[your filename]")
print(bookmark_dict(reader.getOutlines()))
However, note that you will overwrite and lose some values if there are multiple bookmarks on the same page (dictionary keys must be unique).
来源:https://stackoverflow.com/questions/54303318/read-all-bookmarks-from-a-pdf-document-and-create-a-dictionary-with-pagenumber-a