How to extract text from an existing docx file using python-docx

痴心易碎 提交于 2019-11-27 11:47:31

you can try this

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

Without Installing python-docx

docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without need to install python-docx and lxml which sometimes create problem:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.

The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.

Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it.

you can try this also

from docx import Document

document = Document('demo.docx')
for para in document.paragraphs:
    print(para.text)

Using python-docx, as @Chinmoy Panda 's answer shows:

for para in doc.paragraphs:
    fullText.append(para.text)

However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:

def para2text(p):
    rs = p._element.xpath('.//w:t')
    return u" ".join([r.text for r in rs])

I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85 BP

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!