Export PDF page labels on command line

前端 未结 1 770
-上瘾入骨i
-上瘾入骨i 2021-02-08 14:07

I\'d like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but

1条回答
  •  清歌不尽
    2021-02-08 14:47

    Short answer:
    I am not aware of any (free) tool that can 'simply print' the page label for each page.

    Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.

    Long answer:
    There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:

    1. Each PDF document must contain a root object.
    2. That root object must be of /Type /Catalog.
    3. The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
    4. IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.

    Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.

    If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.

    I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.

    But instead I'll give you an example in ASCII PDF code:

    213 0 obj
      << /Type /Catalog
         /PageLabels 
            << 
               /Nums 
                     [ 
                       0 <<           % start labeling from page no. 1
                           /S /r      % label with lowercase roman numbers
                         >> 
                       7 <<           % start new labeling from page no. 8
                           /S /D      % label with standard decimal numbers
                         >> 
                       11 <<          % start labeling page no. 12
                           /S /D      % label with decimal numbers...
                           /P (ABCD-) %   ...but using label prefix 'ABCD-'...
                           /St 3      %   ...followed by '3' as the start decimal.
                         >>
                      ]
            >>
         %%...........................
         %%...more root object keys...
         %%........................... 
      >>
    endobj
    

    The above example will label the pages number 1, 2, 3, ... (last) like this:

    i
    ii
    iii
    iv
    v
    vi
    1
    2
    3
    4
    ABCD-3
    ABCD-4
    ABCD-5
    ABCD-6
    ...and so on until last page...
    

    As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.

    0 讨论(0)
提交回复
热议问题