I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated in
A hard page break will appear as a <w:br>
element within a run element (<w:r>
), something like this:
<w:p>
<w:r>
<w:t>some text</w:t>
<w:br w:type="page"/>
</w:r>
</w:p>
So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".
An implementation of that would be something like this:
from lxml import etree
from docx import nsprefixes
page_br_elements = document.xpath(
"//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
t.text = '{{foobar}}'
br.addprevious(t)
parent = br.getparent()
parent.remove(br)
I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is lxml
method calls on _Element.
Let me know how you go and I can tweak this if needed.