I\'m trying to use python-docx
module (pip install python-docx
)
but it seems to be very confusing as in github repo test sample they are using
Without Installing python-docx
docx
is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx
file, without the need to rely on python-docx
and lxml
the latter being sometimes hard to install:
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
Using python-docx, as @Chinmoy Panda 's answer shows:
for para in doc.paragraphs:
fullText.append(para.text)
However, para.text
will lost the text in w:smarttag
(Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])
you can try this also
from docx import Document
document = Document('demo.docx')
for para in document.paragraphs:
print(para.text)
I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85 BP
you can try this
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.