Python PyPDF2 join pages

若如初见. 提交于 2019-12-11 16:38:22

问题


I have a PDF with a big table splitted in pages, so I need to join the per-page tables into a big table in a large page.

Is this possible with PyPDF2 or another library?

Cheers


回答1:


Just working on something similar, it takes an input pdf and via a config file you can set the final pattern of single pages. Implementation with PyPDF2 but it still has issues with some pdf-files (have to dig deeper). https://github.com/Lageos/pdf-stitcher

In principle adding a page right to another one works like:

import PyPDF2

with open('input.pdf', 'rb') as input_file:
    # load input pdf
    input_pdf = PyPDF2.PdfFileReader(input_file)

    # start new PyPDF2 PageObject
    output_pdf = input_pdf.getPage(page_number)

    # get second page PyPDF2 PageObject
    second_pdf = input_pdf.getPage(second_page_number)

    # dimensions for offset from loaded page (adding it to the right)
    offset_x = output_pdf.mediaBox[2]
    offset_y = 0

    # add second page to first one
    output_pdf.mergeTranslatedPage(second_pdf, offset_x, offset_y, expand=True)

    # write finished pdf
    with open('output.pdf', 'wb') as out_file:
            write_pdf = PyPDF2.PdfFileWriter()
            write_pdf.addPage(output_pdf)
            write_pdf.write(out_file)

Adding a page below needs an offset_y. You can get the amount from offset_y = first_pdf.mediaBox[3].




回答2:


My understanding is that this is quite hard. See here and here.

The problem seems to be that tables aren't very well represented in pdfs but are simply made from absolutely positioned lines (see first link above).

Here are two possible workarounds (not sure if they will do it for you):

  • you could print multiple pages on one page and scale the page to make it readable....
  • open the pdf with inkscape or something similar. Once ungrouped, you should have access to the individual elements that make up the tables and be able to combine them the way that suit you

EDIT

Have a look at libre office draw, another vector package. I just opened a pdf in it and it seems to preserve some of the pdf structure and editing individual elements.

EDIT 2 Have a look at pdftables which might help.

PDFTables helps with extracting tables from PDF files.

I haven't tried it though... might have some time a bit later to see if I can get it to work.



来源:https://stackoverflow.com/questions/24580454/python-pypdf2-join-pages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!