Extracting .docx data, images and structure

问题

Good day SO,

I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of the document during traversal (via checking for Regex, keywords, etc.), I am unable to extract:

The structure of the document
Detect Images that are in-between text

Am I able to obtain, for example, an array of the structure of the document below?

['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']

My current implementation is shown below:

from docx import Document

document = docx.Document('demo.docx')

text = []

for x in document.paragraphs:
    if x.text != '':
        text.append(x.text)

Using the code above, I am able to obtain all the Text data from the document, but I am unable to detect the type of text (Header or Normal), and I am unable to detect any Images. I am currently using python-docx.

My main problem is to obtain the position of the image within the document (i.e. between paragraphs) so that I can re-create another document, using text and images extracted. This task requires me to know where the image appears in the document, and where to insert the image in the new document.

Any help is greatly appreciated, thank you :)

回答1:

For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.

from docx import Document
document = docx.Document('demo.docx')
text  = []
style = []
for x in document.paragraphs:
    if x.text != '':
        style.append(x.style.name)
        text.append(x.text)

with x.style.name you can get the styling of text in your document.

You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by

for elem in document.element.getiterator():
    print(elem.tag)

Let me know if you need anything else.

For extracting image name and its location use this.

tags = []
text = []
for t in doc.element.getiterator():
    if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
        if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
            print('Picture Found: ',t.attrib['name'])
            tags.append('Picture')
            text.append(t.attrib['name'])
        elif t.text:
            tags.append('text')
            text.append(t.text)

You can check previous and next text from text list and their tag from the tag list.

来源：https://stackoverflow.com/questions/57554398/extracting-docx-data-images-and-structure

标签

python

python-docx