How to iterate over everything in a python-docx document?

大憨熊 提交于 2019-11-30 08:36:10

问题


I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...

for para in doc.paragraphs:
    for run in para.runs:
        # How to tell if this run has images or tables?

...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?

Thanks!


回答1:


There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.

python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you: https://github.com/python-openxml/python-docx/issues/40

There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.

This requirement comes up from time to time so we'll definitely want to add it at some point.

Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.




回答2:


Assuming doc is of type Document, then what you want to do is have 3 separate iterations:

  • One for the paragraphs, as you have in your code
  • One for the tables, via doc.tables
  • One for the shapes, via doc.inline_shapes

The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.

Here is the documentation for more info: python-docx



来源:https://stackoverflow.com/questions/25130957/how-to-iterate-over-everything-in-a-python-docx-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!