How to extract text from an existing docx file using python-docx

后端未结

关注

 7  1049

I\'m trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using

相关标签:

7条回答

野趣味

2020-11-27 16:29

Without Installing python-docx

docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2020-11-27 16:32
Using python-docx, as @Chinmoy Panda 's answer shows:
```
for para in doc.paragraphs:
    fullText.append(para.text)
```
However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
```
def para2text(p):
    rs = p._element.xpath('.//w:t')
    return u" ".join([r.text for r in rs])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

误落风尘

2020-11-27 16:35

you can try this also

from docx import Document

document = Document('demo.docx')
for para in document.paragraphs:
    print(para.text)

0 讨论(0)

既然无缘

2020-11-27 16:35

I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85 BP

0 讨论(0)
发布评论:

提交评论
- 加载中...

渐次进展

2020-11-27 16:37

you can try this

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

0 讨论(0)

佛祖请我去吃肉

2020-11-27 16:43

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页