问题
I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.
I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs
For debugging I have tried
doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
print('\n'.join(fullText))
Which only seems to print out about half of the file's contents.
There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?
Edit: the missing text is contained within a mail merge field if that makes any difference
回答1:
The mail merge field does make a difference. Unfortunately, python-docx
is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.
This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).
The only way around it with python-docx
is to navigate the XML yourself, although some of the domain objects in python-docx
can be handy, like Paragraph
, etc. once you've gotten hold of the elements you want.
来源:https://stackoverflow.com/questions/48350116/missing-document-text-when-using-python-docx