How do I write a python script that can read doc/docx files and convert them to txt?

前端 未结 3 1032
终归单人心
终归单人心 2021-01-29 09:26

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files an

相关标签:
3条回答
  • 2021-01-29 10:08

    I figured this would make an interesting quick programming project. This has only been tested on a simple .docx file containing "Hello, world!", but the train of logic should give you a place to work from to parse more complex documents.

    from shutil import copyfile, rmtree
    import sys
    import os
    import zipfile
    from lxml import etree
    
    # command format: python3 docx_to_txt.py Hello.docx
    
    # let's get the file name
    zip_dir = sys.argv[1]
    # cut off the .docx, make it a .zip
    zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
    # make a copy of the .docx and put it in .zip
    copyfile(zip_dir, zip_dir_zip_ext)
    # unzip the .zip
    zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
    zip_ref.extractall('./temp')
    # get the xml out of /word/document.xml
    data = etree.parse('./temp/word/document.xml')
    # we'll want to go over all 't' elements in the xml node tree.
    # note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
    # each :t element is the "text" of the file. that's what we're looking for
    # result is a list filled with the text of each t node in the xml document model
    result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
    # dump result into a new .txt file
    with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
        # join the elements of result together since txt.write can't take lists
        joined_result = '\n'.join(result)
        # write it into the new file
        txt.write(joined_result)
    # close the zip_ref file
    zip_ref.close()
    # get rid of our mess of working directories
    rmtree('./temp')
    os.remove(zip_dir_zip_ext)
    

    I'm sure there's a more elegant or pythonic way to accomplish this. You'll need to have the file you want to convert in the same directory as the python file. Command format is python3 docx_to_txt.py file_name.docx

    0 讨论(0)
  • 2021-01-29 10:12

    conda install -c conda-forge python-docx

    from docx import Document doc = Document(file)

    for p in doc.paragrafs: print(p.text) pass

    0 讨论(0)
  • 2021-01-29 10:20

    Thought I would share my approach, basically boils down to two commands that convert either .doc or .docx to a string, both options require a certain package:

    import docx
    import os
    import glob
    import subprocess
    import sys
    
    # .docx (pip3 install python-docx)
    doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
    # .doc (apt-get install antiword)
    doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
    

    I then wrap these solutions up in a function, that can either return the result as a python string, or write to a file (with the option of appending or replacing).

    import docx
    import os
    import glob
    import subprocess
    import sys
    
    def doc2txt(infile, outfile, return_string=False, append=False):
        if os.path.exists(infile):
            if infile.endswith(".docx"):
                try:
                    doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
                except Exception as e:
                    print("Exception in converting .docx to str: ", e)
                    return None
            elif infile.endswith(".doc"):
                try:
                    doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
                except Exception as e:
                    print("Exception in converting .docx to str: ", e)
                    return None
            else:
                print("{0} is not .doc or .docx".format(infile))
                return None
    
            if return_string == True:
                return doctext
            else:
                writemode = "a" if append==True else "w"
                with open(outfile, writemode) as f:
                    f.write(doctext)
                    f.close()
        else:
            print("{0} does not exist".format(infile))
            return None
    

    I then would call this function via something like:

    files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
    outfile = "/path/to/out.txt"
    for file in files:
        doc2txt(file, outfile, return_string=False, append=True)
    

    It's not often I need to perform this operation, but up until now the script has worked for all my needs, if you find this function has a bug let me know in a comment.

    0 讨论(0)
提交回复
热议问题