Reading .docx files in Python to find strikethrough, bullets and other formats

别等时光非礼了梦想. 提交于 2019-12-11 05:44:50

问题


Can anyone help me identify, in Python using python-docx, if a paragraph in a .docx file contains text that is formatted with strikethrough (ie. it appears but is crossed out), or has a bullet point at the start? I am trying to write a script to identify the structure in a document and parse the content.

So far I am able to read a .docx file and iterate over the paragraphs, identifying paragraphs that are bold.

from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
    print p.text
    for run in p.runs:
        if run.bold:
            print 'BOLD ' + run.text

The rest eludes me for the moment.


回答1:


For strikethrough, you can just modify your example like so:

from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
    for run in p.runs:
        if run.font.strike:
            print "STRIKE: " + run.text

See the API docs for the Font object for more fun stuff you can check.




回答2:


Using a native Word DocX parser, rather than converting it to HTML and using an HTML parser, per the Python DocX Docs:

from docx.enum.style import WD_STYLE_TYPE
styles = document.styles
paragraph_styles = [
    s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH
]
for style in paragraph_styles:
    if style.name == 'List Bullet':
        print "I'm a bullet"



回答3:


Following from the suggestion from mkrieger1 - I would suggest to use Pandoc to convert .docx to .html and parse the document from there.

Installing Pandoc is the same effort as installing python-docx and the conversion from .docx to .html worked like a charm using Pandoc. In .html the structure of the document I am parsing, and all format elements, is absolutely clear and thus easy to work with.



来源:https://stackoverflow.com/questions/46646654/reading-docx-files-in-python-to-find-strikethrough-bullets-and-other-formats

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!