How to extract text inserted with track-changes in python-docx

前端 未结 3 2032
误落风尘
误落风尘 2021-01-23 03:19

I want to extract text from word documents that were edited in \"Track Changes\" mode. I want to extract the inserted text and ignore the deleted text.

Running the below

相关标签:
3条回答
  • 2021-01-23 03:53

    I was having the same problem for years (maybe as long as this question existed).

    By looking at the code of "etienned" posted by @yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.

    The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).

    Hope it can help the souls lost like I was:

    from docx import Document
    
    try:
        from xml.etree.cElementTree import XML
    except ImportError:
        from xml.etree.ElementTree import XML
    
    
    WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
    TEXT = WORD_NAMESPACE + "t"
    
    
    def get_accepted_text(p):
        """Return text of a paragraph after accepting all changes"""
        xml = p._p.xml
        if "w:del" in xml or "w:ins" in xml:
            tree = XML(xml)
            runs = (node.text for node in tree.getiterator(TEXT) if node.text)
            return "".join(runs)
        else:
            return p.text
    
    
    doc = Document("Hello.docx")
    
    for p in doc.paragraphs:
        print(p.text)
        print("---")
        print(get_accepted_text(p))
        print("=========")
    
    0 讨论(0)
  • 2021-01-23 03:55

    the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)

    http://etienned.github.io/posts/extract-text-from-word-docx-simply/

    0 讨论(0)
  • 2021-01-23 04:11

    Not directly using python-docx; there's no API support yet for tracked changes/revisions.

    It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result: https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

    If I needed to do something like that in a pinch I'd get the body element using:

    body = document._body._body
    

    and then use XPath on that to return the elements I wanted, something vaguely like this aircode:

    from docx.text.paragraph import Paragraph
    
    inserted_ps = body.xpath('./w:ins//w:p')
    for p in inserted_ps:
        paragraph = Paragraph(p, None)
        print(paragraph.text)
    

    You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.

    opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html

    0 讨论(0)
提交回复
热议问题