How do I write a python script that can read doc/docx files and convert them to txt?

前端 未结 3 1034
终归单人心
终归单人心 2021-01-29 09:26

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files an

3条回答
  •  遥遥无期
    2021-01-29 10:20

    Thought I would share my approach, basically boils down to two commands that convert either .doc or .docx to a string, both options require a certain package:

    import docx
    import os
    import glob
    import subprocess
    import sys
    
    # .docx (pip3 install python-docx)
    doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
    # .doc (apt-get install antiword)
    doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
    

    I then wrap these solutions up in a function, that can either return the result as a python string, or write to a file (with the option of appending or replacing).

    import docx
    import os
    import glob
    import subprocess
    import sys
    
    def doc2txt(infile, outfile, return_string=False, append=False):
        if os.path.exists(infile):
            if infile.endswith(".docx"):
                try:
                    doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
                except Exception as e:
                    print("Exception in converting .docx to str: ", e)
                    return None
            elif infile.endswith(".doc"):
                try:
                    doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
                except Exception as e:
                    print("Exception in converting .docx to str: ", e)
                    return None
            else:
                print("{0} is not .doc or .docx".format(infile))
                return None
    
            if return_string == True:
                return doctext
            else:
                writemode = "a" if append==True else "w"
                with open(outfile, writemode) as f:
                    f.write(doctext)
                    f.close()
        else:
            print("{0} does not exist".format(infile))
            return None
    

    I then would call this function via something like:

    files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
    outfile = "/path/to/out.txt"
    for file in files:
        doc2txt(file, outfile, return_string=False, append=True)
    

    It's not often I need to perform this operation, but up until now the script has worked for all my needs, if you find this function has a bug let me know in a comment.

提交回复
热议问题