Retrieve page numbers from document with pyPDF

后端 未结 5 1619
逝去的感伤
逝去的感伤 2020-12-28 15:51

At the moment I\'m looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I\'m looking into scraping each page for its page

相关标签:
5条回答
  • 2020-12-28 15:56

    The other answers use PyPDF/PyPDF2 which seems to read the entire file. This takes a long time for large files.

    In the meantime I wrote something quick and dirty which doesn't take nearly as long to run. It does a shell call but I wasn't aware of any other way to do it. It can get the number of pages for pdfs that are ~5000 pages very quickly.

    It works by just calling the "pdfinfo" shell command, so it probably only works in linux. I've only tested it on ubuntu so far.

    One strange behavior I've seen is that surrounding this in a try/except block doesn't catch errors, you have to except subprocess.CalledProcessError.

    from subprocess import check_output
    def get_num_pages(pdf_path):
        output = check_output(["pdfinfo", pdf_path]).decode()
        pages_line = [line for line in output.splitlines() if "Pages:" in line][0]
        num_pages = int(pages_line.split(":")[1])
        return num_pages
    
    0 讨论(0)
  • 2020-12-28 16:02

    The following worked for me:

    from PyPDF2 import PdfFileReader
    pdf = PdfFileReader(open('path/to/file.pdf','rb'))
    pdf.getNumPages()
    
    0 讨论(0)
  • 2020-12-28 16:12

    For full documentation, see Adobe's 978-page PDF Reference. :-)

    More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

    In PyPDF, to get at this information, try, as a starting point:

    pdf.trailer["/Root"]["/PageLabels"]["/Nums"]
    

    By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

    Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!

    0 讨论(0)
  • 2020-12-28 16:17

    Another Option is pymupdf: https://pymupdf.readthedocs.io/en/latest/tutorial.html

    import fitz
    
    doc = fitz.open('Path To File')
    doc.pageCount
    

    pip install pymupdf

    For large documents I was getting a recursion error when using pypdf2 so this was another quick and simple way.

    0 讨论(0)
  • 2020-12-28 16:18

    The answer by kindall is very good. However, since a working code sample was requested later (by dreamer) and since I had the same problem today, I would like to add some notes.

    1. pdf structure is not uniform; there are rather few things you can rely on, hence any working code sample is very unlikely to work for everyone. A very good explanation can be found in this answer.

    2. As explained by kindall, you will most likely need to explore what pdf you are dealing with.

    Like so:

    import sys
    import PyPDF2 as pyPdf
    
    """Open your pdf"""
    pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))
    
    """Explore the /PageLabels (if it exists)"""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """Select the item that is most likely to contain the information you desire; e.g.
           {'/Nums': [0, IndirectObject(42, 0)]}
       here, we only have "/Num". """
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"]
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """If you see a list, like
           [0, IndirectObject(42, 0)]
       get the correct item from it"""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1]
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """If you then have an indirect object, like
           IndirectObject(42, 0)
       use getObject()"""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """Now we have e.g.
           {'/S': '/r', '/St': 21}
       meaning roman numerals, starting with page 21, i.e. xxi. We can now also obtain the two variables directly."""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
        print(page_label_type)
        start_page = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
        print(start_page)
    except:
        print("No /PageLabel object")
    
    1. As you can see from the ISO pdf 1.7 specification (relevant section here) there are lots of possibilities of how to label pages. As a simple working example consider this script that will at least deal with decimal (arabic) and with roman numerals:

    Script:

    import sys
    import PyPDF2 as pyPdf
    
    def arabic_to_roman(arabic):
        roman = ''
        while arabic >= 1000:
          roman += 'm'
          arabic -= 1000
        diffs = [900, 500, 400, 300, 200, 100, 90, 50, 40, 30, 20, 10, 9, 5, 4, 3, 2, 1]
        digits = ['cm', 'd', 'cd', 'ccc', 'cc', 'c', 'xc', 'l', 'xl', 'xxx', 'xx', 'x', 'ix', 'v', 'iv', 'iii', 'ii', 'i']
        for i in range(len(diffs)):
          if arabic >= diffs[i]:
            roman += digits[i]
            arabic -= diffs[i]
        return(roman)
    
    def get_page_labels(pdf):
        try:
            page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
        except:
            page_label_type = "/D"
        try:
            page_start = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
        except:
            page_start = 1
        page_count = pdf.getNumPages()
        ##or, if you feel fancy, do:
        #page_count = pdf.trailer["/Root"]["/Pages"]["/Count"]
        page_stop = page_start + page_count 
    
        if page_label_type == "/D":
            page_numbers = list(range(page_start, page_stop))
            for i in range(len(page_numbers)):
                page_numbers[i] = str(page_numbers[i])
        elif page_label_type == '/r':
            page_numbers_arabic = range(page_start, page_stop)
            page_numbers = []
            for i in range(len(page_numbers_arabic)):
                page_numbers.append(arabic_to_roman(page_numbers_arabic[i]))
    
        print(page_label_type)
        print(page_start)
        print(page_count)
        print(page_numbers)
    
    pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))
    get_page_labels(pdf)
    
    0 讨论(0)
提交回复
热议问题