Missing document text when using python-docx

人走茶凉 提交于 2019-12-08 06:15:33

问题


I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.

I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs

For debugging I have tried

doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
    fullText.append(para.text)
print('\n'.join(fullText))

Which only seems to print out about half of the file's contents.

There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?

Edit: the missing text is contained within a mail merge field if that makes any difference


回答1:


The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.

This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).

The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph, etc. once you've gotten hold of the elements you want.



来源:https://stackoverflow.com/questions/48350116/missing-document-text-when-using-python-docx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!