问题
Can anyone help me identify, in Python using python-docx, if a paragraph in a .docx file contains text that is formatted with strikethrough (ie. it appears but is crossed out), or has a bullet point at the start? I am trying to write a script to identify the structure in a document and parse the content.
So far I am able to read a .docx file and iterate over the paragraphs, identifying paragraphs that are bold.
from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
print p.text
for run in p.runs:
if run.bold:
print 'BOLD ' + run.text
The rest eludes me for the moment.
回答1:
For strikethrough, you can just modify your example like so:
from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
for run in p.runs:
if run.font.strike:
print "STRIKE: " + run.text
See the API docs for the Font object for more fun stuff you can check.
回答2:
Using a native Word DocX parser, rather than converting it to HTML and using an HTML parser, per the Python DocX Docs:
from docx.enum.style import WD_STYLE_TYPE
styles = document.styles
paragraph_styles = [
s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH
]
for style in paragraph_styles:
if style.name == 'List Bullet':
print "I'm a bullet"
回答3:
Following from the suggestion from mkrieger1 - I would suggest to use Pandoc to convert .docx to .html and parse the document from there.
Installing Pandoc is the same effort as installing python-docx and the conversion from .docx to .html worked like a charm using Pandoc. In .html the structure of the document I am parsing, and all format elements, is absolutely clear and thus easy to work with.
来源:https://stackoverflow.com/questions/46646654/reading-docx-files-in-python-to-find-strikethrough-bullets-and-other-formats