问题
I have one project where I need to apply a dozen or so regex to about 100 files using python. 4+ hours of searching the web for various combinations including "(merge|concatenate|stack|join|compile) multiple regex in python" and I haven't found any posts regarding my need.
This is a mid-sized project for me. There are several smaller regex projects that I need which take only 5-6 regex patterns applied over only a dozen or so files. While these will be a great aid in my work, the grand-daddy project is a applying a file of 100+ search, replace strings to any new file I get. (Spelling conventions in certain languages are not standardized and being able to quick-process files will increase productivity.)
Ideally, the regex strings need to be update-able by a non programmer, but that maybe outside of the scope of this post.
Here is what I have so far:
import os, re, sys # Is "sys" necessary?
path = "/Users/mypath/testData"
myfiles = os.listdir(path)
for f in myfiles:
# split the filename and file extension for use in renaming the output file
file_name, file_extension = os.path.splitext(f)
generated_output_file = file_name + "_regex" + file_extension
# Only process certain types of files.
if re.search("txt|doc|odt|htm|html")
# Declare input and output files, open them, and start working on each line.
input_file = os.path.join(path, f)
output_file = os.path.join(path, generated_output_file)
with open(input_file, "r") as fi, open(output_file, "w") as fo:
for line in fi:
# I realize that the examples are not regex, but they are in my real data.
# The important thing, is that each of these is a substitution.
line = re.sub(r"dog","cat" , line)
line = re.sub(r"123", "789" , line)
# Etc.
# Obviously this doesn't work, because it is only writing the last instance of line.
fo.write(line)
fo.close()
回答1:
Is this what you're looking for?
Unfortunately you didn't specify how you know which regexes are supposed to be applied, so I put them into a list of tuples (first element is the regex, second is the replacement text).
import os, os.path, re
path = "/Users/mypath/testData"
myfiles = os.listdir(path)
# its much faster if you compile your regexes before you
# actually use them in a loop
REGEXES = [(re.compile(r'dog'), 'cat'),
(re.compile(r'123'), '789')]
for f in myfiles:
# split the filename and file extension for use in
# renaming the output file
file_name, file_extension = os.path.splitext(f)
generated_output_file = file_name + "_regex" + file_extension
# As l4mpi said ... if odt is zipped, you'd need to unzip it first
# re.search is slower than a simple if statement
if file_extension in ('.txt', '.doc', '.odt', '.htm', '.html'):
# Declare input and output files, open them,
# and start working on each line.
input_file = os.path.join(path, f)
output_file = os.path.join(path, generated_output_file)
with open(input_file, "r") as fi, open(output_file, "w") as fo:
for line in fi:
for search, replace in REGEXES:
line = search.sub(replace, line)
fo.write(line)
# both the input and output files are closed automatically
# after the with statement closes
来源:https://stackoverflow.com/questions/12551338/multiple-regex-substitution-in-multiple-files-using-python