Retrieve document content with document structure with python-docx

梦想与她 提交于 2019-12-02 05:18:58

问题


I have to retrieve tables and previous/next paragraphs from docx file, but can't imagine how to obtain this with python-docx

I can get a list of paragraphs by document.paragraphs

I can get a list of tables by document.tables

How can I get an ordered list of document elements like this

[
Paragraph1,
Paragraph2,
Table1,
Paragraph3,
Table3,
Paragraph4,
...
]?

回答1:


python-docx doesn't yet have API support for this; interestingly, the Microsoft Word API doesn't either.

But you can work around this with the following code. Note that it's a bit brittle because it makes use of python-docx internals that are subject to change, but I expect it will work just fine for the foreseeable future:

#!/usr/bin/env python
# encoding: utf-8

"""
Testing iter_block_items()
"""

from __future__ import (
    absolute_import, division, print_function, unicode_literals
)

from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Generate a reference to each paragraph and table child within *parent*,
    in document order. Each returned value is an instance of either Table or
    Paragraph. *parent* would most commonly be a reference to a main
    Document object, but also works for a _Cell object, which itself can
    contain paragraphs and tables.
    """
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
        # print(parent_elm.xml)
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


document = Document('test.docx')
for block in iter_block_items(document):
    print('found one')
    print(block.text if isinstance(block, Paragraph) else '<table>')

There is some more discussion of this here:
https://github.com/python-openxml/python-docx/issues/276




回答2:


Resolved as property Document.story, contains paragraphs and tables in document order

https://github.com/python-openxml/python-docx/pull/395

document = Document('test.docx')
document.story


来源:https://stackoverflow.com/questions/43637211/retrieve-document-content-with-document-structure-with-python-docx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!