Reading .docx files in Python to find strikethrough, bullets and other formats

问题

Can anyone help me identify, in Python using python-docx, if a paragraph in a .docx file contains text that is formatted with strikethrough (ie. it appears but is crossed out), or has a bullet point at the start? I am trying to write a script to identify the structure in a document and parse the content.

So far I am able to read a .docx file and iterate over the paragraphs, identifying paragraphs that are bold.

from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
    print p.text
    for run in p.runs:
        if run.bold:
            print 'BOLD ' + run.text

The rest eludes me for the moment.

回答1:

For strikethrough, you can just modify your example like so:

from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
    for run in p.runs:
        if run.font.strike:
            print "STRIKE: " + run.text

See the API docs for the Font object for more fun stuff you can check.

回答2:

Using a native Word DocX parser, rather than converting it to HTML and using an HTML parser, per the Python DocX Docs:

from docx.enum.style import WD_STYLE_TYPE
styles = document.styles
paragraph_styles = [
    s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH
]
for style in paragraph_styles:
    if style.name == 'List Bullet':
        print "I'm a bullet"

回答3:

Following from the suggestion from mkrieger1 - I would suggest to use Pandoc to convert .docx to .html and parse the document from there.

Installing Pandoc is the same effort as installing python-docx and the conversion from .docx to .html worked like a charm using Pandoc. In .html the structure of the document I am parsing, and all format elements, is absolutely clear and thus easy to work with.

来源：https://stackoverflow.com/questions/46646654/reading-docx-files-in-python-to-find-strikethrough-bullets-and-other-formats

标签

python

pandoc

python-docx