Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files an
Thought I would share my approach, basically boils down to two commands that convert either .doc
or .docx
to a string, both options require a certain package:
import docx
import os
import glob
import subprocess
import sys
# .docx (pip3 install python-docx)
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
# .doc (apt-get install antiword)
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
I then wrap these solutions up in a function, that can either return the result as a python string, or write to a file (with the option of appending or replacing).
import docx
import os
import glob
import subprocess
import sys
def doc2txt(infile, outfile, return_string=False, append=False):
if os.path.exists(infile):
if infile.endswith(".docx"):
try:
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
except Exception as e:
print("Exception in converting .docx to str: ", e)
return None
elif infile.endswith(".doc"):
try:
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
except Exception as e:
print("Exception in converting .docx to str: ", e)
return None
else:
print("{0} is not .doc or .docx".format(infile))
return None
if return_string == True:
return doctext
else:
writemode = "a" if append==True else "w"
with open(outfile, writemode) as f:
f.write(doctext)
f.close()
else:
print("{0} does not exist".format(infile))
return None
I then would call this function via something like:
files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
outfile = "/path/to/out.txt"
for file in files:
doc2txt(file, outfile, return_string=False, append=True)
It's not often I need to perform this operation, but up until now the script has worked for all my needs, if you find this function has a bug let me know in a comment.