问题
I have a folder with multiple text files inside that I need to process and format using multiple replacement lists looking like this:
old string1~new string1
old string2~new string2
etc~blah
I run each replacement pair from replacement lists on each line of those multiple text files. Now I have a set of python scripts to perform this operation. What I wonder about is will it make the code simpler and better maintainable if I switch to sed or awk? Will it be a better solution or should I better improve my Python code? I ask because incoming text files come on regular basis and often have a little different structure than it was before, like mistakes, misspellings, multiple spaces, as these files are being created by humans. So I have to constantly tweak my code and replacement lists to make it work properly. Thanks.
回答1:
Unless your python code is really bad, it is not likely that switching to awk will make it more maintainable. That said, it's pretty simple in awk, but does not scale well:
cat replacement-list-files* | awk 'FILENAME == "-" {
split( $0, a, "~" ); repl[ a[1] ] = a[2]; next }
{ for( i in repl ) gsub( i, repl[i] ) }1' - input-file
Note that this works on one file at a time. Replace 1
with something like { print > ( FILENAME ".new" ) }
to work on multiple files, but then you have to deal with closing the files if you want to work on a large number of files, and it quickly becomes an unmaintainable mess. Stick with Python if you already have a working solution.
回答2:
Here's the regular expression replacement script (mostly just cosmetically different from what @WilliamPursell posted):
awk -F'~' '
NR==FNR{ map[$1] = $2; next }
{
for (old in map) {
gsub(old,map[old]
}
}
' /wherever/mappingFile file
but here's the string replacement script that I think you really need:
awk -F'~' '
NR==FNR{ map[$1] = $2; next }
{
for (old in map) {
rlength = length(old)
while (rstart = index($0,old)) {
$0 = substr($0,1,rstart-1) map[old] substr($0,rstart+rlength)
}
}
}
' /wherever/mappingFile file
In either case just enclose it in a shell loop to affect multiple files:
for file in *
do
awk -F'~' '...' /wherever/mappingFile "$file" > tmp && mv tmp "$file"
done
来源:https://stackoverflow.com/questions/15829445/multiple-regex-replacements-based-on-lists-in-multiple-files