How to extract text inserted with track-changes in python-docx

前端 未结 3 2027
误落风尘
误落风尘 2021-01-23 03:19

I want to extract text from word documents that were edited in \"Track Changes\" mode. I want to extract the inserted text and ignore the deleted text.

Running the below

3条回答
  •  北恋
    北恋 (楼主)
    2021-01-23 03:53

    I was having the same problem for years (maybe as long as this question existed).

    By looking at the code of "etienned" posted by @yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.

    The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the elements from the XML code, which contains both regular runs and blocks).

    Hope it can help the souls lost like I was:

    from docx import Document
    
    try:
        from xml.etree.cElementTree import XML
    except ImportError:
        from xml.etree.ElementTree import XML
    
    
    WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
    TEXT = WORD_NAMESPACE + "t"
    
    
    def get_accepted_text(p):
        """Return text of a paragraph after accepting all changes"""
        xml = p._p.xml
        if "w:del" in xml or "w:ins" in xml:
            tree = XML(xml)
            runs = (node.text for node in tree.getiterator(TEXT) if node.text)
            return "".join(runs)
        else:
            return p.text
    
    
    doc = Document("Hello.docx")
    
    for p in doc.paragraphs:
        print(p.text)
        print("---")
        print(get_accepted_text(p))
        print("=========")
    

提交回复
热议问题