I want to extract text from word documents that were edited in \"Track Changes\" mode. I want to extract the inserted text and ignore the deleted text.
Running the below
I was having the same problem for years (maybe as long as this question existed).
By looking at the code of "etienned" posted by @yiftah and the attributes of Paragraph
, I have found a solution to retrieve the text after accepting the changes.
The trick was to get p._p.xml
to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the
elements from the XML code, which contains both regular runs and
blocks).
Hope it can help the souls lost like I was:
from docx import Document
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
doc = Document("Hello.docx")
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")