How to extract text inserted with track-changes in python-docx

情到浓时终转凉″ 提交于 2019-12-02 16:26:44

问题


I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.

Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text

import docx

doc = docx.Document('C:\\test track changes.docx')

for para in doc.paragraphs:
    print(para)
    print(para.text)

Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?

I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7

Thanks


回答1:


Not directly using python-docx; there's no API support yet for tracked changes/revisions.

It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result: https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

If I needed to do something like that in a pinch I'd get the body element using:

body = document._body._body

and then use XPath on that to return the elements I wanted, something vaguely like this aircode:

from docx.text.paragraph import Paragraph

inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
    paragraph = Paragraph(p, None)
    print(paragraph.text)

You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.

opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html




回答2:


the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)

http://etienned.github.io/posts/extract-text-from-word-docx-simply/




回答3:


I was having the same problem for years (maybe as long as this question existed).

By looking at the code of "etienned" posted by @yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.

The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).

Hope it can help the souls lost like I was:

from docx import Document

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML


WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"


def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text


doc = Document("Hello.docx")

for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========")


来源:https://stackoverflow.com/questions/38247251/how-to-extract-text-inserted-with-track-changes-in-python-docx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!