How to extract text from an existing docx file using python-docx

后端 未结 7 1035
不思量自难忘°
不思量自难忘° 2020-11-27 15:59

I\'m trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using

相关标签:
7条回答
  • 2020-11-27 16:29

    Without Installing python-docx

    docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:

    http://etienned.github.io/posts/extract-text-from-word-docx-simply/

    0 讨论(0)
  • 2020-11-27 16:32

    Using python-docx, as @Chinmoy Panda 's answer shows:

    for para in doc.paragraphs:
        fullText.append(para.text)
    

    However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:

    def para2text(p):
        rs = p._element.xpath('.//w:t')
        return u" ".join([r.text for r in rs])
    
    0 讨论(0)
  • 2020-11-27 16:35

    you can try this also

    from docx import Document
    
    document = Document('demo.docx')
    for para in document.paragraphs:
        print(para.text)
    
    0 讨论(0)
  • 2020-11-27 16:35

    I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85 BP

    0 讨论(0)
  • 2020-11-27 16:37

    you can try this

    import docx
    
    def getText(filename):
        doc = docx.Document(filename)
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text)
        return '\n'.join(fullText)
    
    0 讨论(0)
  • 2020-11-27 16:43

    You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

    0 讨论(0)
提交回复
热议问题