How to identify page breaks using python-docx from docx

大憨熊 提交于 2019-11-30 22:57:53

A hard page break will appear as a <w:br> element within a run element (<w:r>), something like this:

<w:p>
  <w:r>
    <w:t>some text</w:t>
    <w:br w:type="page"/>
  </w:r>
</w:p>

So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".

An implementation of that would be something like this:

from lxml import etree
from docx import nsprefixes

page_br_elements = document.xpath(
    "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
    t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
    t.text = '{{foobar}}'
    br.addprevious(t)
    parent = br.getparent()
    parent.remove(br)

I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is lxml method calls on _Element.

Let me know how you go and I can tweak this if needed.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!