Multiple regex replacements based on lists in multiple files

烈酒焚心 提交于 2019-12-25 02:46:48

问题


I have a folder with multiple text files inside that I need to process and format using multiple replacement lists looking like this:

old string1~new string1
old string2~new string2
etc~blah

I run each replacement pair from replacement lists on each line of those multiple text files. Now I have a set of python scripts to perform this operation. What I wonder about is will it make the code simpler and better maintainable if I switch to sed or awk? Will it be a better solution or should I better improve my Python code? I ask because incoming text files come on regular basis and often have a little different structure than it was before, like mistakes, misspellings, multiple spaces, as these files are being created by humans. So I have to constantly tweak my code and replacement lists to make it work properly. Thanks.


回答1:


Unless your python code is really bad, it is not likely that switching to awk will make it more maintainable. That said, it's pretty simple in awk, but does not scale well:

cat replacement-list-files* | awk 'FILENAME == "-" { 
  split( $0, a, "~" ); repl[ a[1] ] = a[2]; next }
  { for( i in repl ) gsub( i, repl[i] ) }1' - input-file

Note that this works on one file at a time. Replace 1 with something like { print > ( FILENAME ".new" ) } to work on multiple files, but then you have to deal with closing the files if you want to work on a large number of files, and it quickly becomes an unmaintainable mess. Stick with Python if you already have a working solution.




回答2:


Here's the regular expression replacement script (mostly just cosmetically different from what @WilliamPursell posted):

   awk -F'~' '
   NR==FNR{ map[$1] = $2; next }
   {
      for (old in map) {
         gsub(old,map[old]
      }
   }
   ' /wherever/mappingFile file

but here's the string replacement script that I think you really need:

   awk -F'~' '
   NR==FNR{ map[$1] = $2; next }
   {
      for (old in map) {
         rlength = length(old)
         while (rstart = index($0,old)) {
            $0 = substr($0,1,rstart-1) map[old] substr($0,rstart+rlength)
         }
      }
   }
   ' /wherever/mappingFile file

In either case just enclose it in a shell loop to affect multiple files:

for file in *
do
   awk -F'~' '...' /wherever/mappingFile "$file" > tmp && mv tmp "$file"
done


来源:https://stackoverflow.com/questions/15829445/multiple-regex-replacements-based-on-lists-in-multiple-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!