问题
I have to retrieve tables and previous/next paragraphs from docx
file, but can't imagine how to obtain this with python-docx
I can get a list of paragraphs by document.paragraphs
I can get a list of tables by document.tables
How can I get an ordered list of document elements like this
[
Paragraph1,
Paragraph2,
Table1,
Paragraph3,
Table3,
Paragraph4,
...
]?
回答1:
python-docx
doesn't yet have API support for this; interestingly, the Microsoft Word API doesn't either.
But you can work around this with the following code. Note that it's a bit brittle because it makes use of python-docx
internals that are subject to change, but I expect it will work just fine for the foreseeable future:
#!/usr/bin/env python
# encoding: utf-8
"""
Testing iter_block_items()
"""
from __future__ import (
absolute_import, division, print_function, unicode_literals
)
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
parent_elm = parent.element.body
# print(parent_elm.xml)
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):
print('found one')
print(block.text if isinstance(block, Paragraph) else '<table>')
There is some more discussion of this here:
https://github.com/python-openxml/python-docx/issues/276
回答2:
Resolved as property Document.story, contains paragraphs and tables in document order
https://github.com/python-openxml/python-docx/pull/395
document = Document('test.docx')
document.story
来源:https://stackoverflow.com/questions/43637211/retrieve-document-content-with-document-structure-with-python-docx