How to identify page breaks using python-docx from docx

后端 未结 1 1314
情话喂你
情话喂你 2021-01-06 06:49

I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated in

1条回答
  •  -上瘾入骨i
    2021-01-06 06:57

    A hard page break will appear as a element within a run element (), something like this:

    
      
        some text
        
      
    
    

    So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".

    An implementation of that would be something like this:

    from lxml import etree
    from docx import nsprefixes
    
    page_br_elements = document.xpath(
        "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
    )
    for br in page_br_elements:
        t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
        t.text = '{{foobar}}'
        br.addprevious(t)
        parent = br.getparent()
        parent.remove(br)
    

    I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is lxml method calls on _Element.

    Let me know how you go and I can tweak this if needed.

    0 讨论(0)
提交回复
热议问题