awk + bash: combining arbitrary number of files

扶醉桌前 提交于 2020-01-16 13:17:23

问题


I have a script that takes a number of data files with identical layout but different data and combines a specified data column into a new file, like this:

gawk '{
        names[$1]= 1;
        data[$1,ARGIND]= $2
} END {
        for (i in names) print i"\t"data[i,1]"\t"data[i,2]"\t"data[i,3]
}' $1 $2 $3 > combined_data.txt

... where the row IDs can be found in the first column, and the interesting data in the second column.

This works nicely, but not for an arbitrary number of files. While I could simply add $4 $5 ... $n in the last line up to whatever maximum number of files I think I need, as well as add an equal n amount of "\t"data[i,4]"\t"data[i,5] ... "\t"data[i,n] in the line above that (which does seem to work even for files smaller than n; awk seems to disregard that n is larger than the number of input files in those cases), this seems like an "ugly" solution. Is there a way to make this script (or something that gives the same result) take an arbitrary number of input files?

Or, even better, can you somehow incorporate a find in there, that searches through subfolders and finds files matching some criterium?

Here is some sample data:

file.1

A      554
B       13
C      634
D       84
E        9

file.2:

C      TRUE
E      TRUE
F      FALSE

expected output:

A      554
B       13
C      634       TRUE
D       84
E        9       TRUE
F                FALSE

回答1:


This may be what you're looking for (uses GNU awk for ARGIND just like your original script):

$ cat tst.awk
BEGIN { OFS="\t" }
!seen[$1]++ { keys[++numKeys]=$1 }
{ vals[$1,ARGIND]=$2 }
END {
    for (rowNr=1; rowNr<=numKeys; rowNr++) {
        key = keys[rowNr]
        printf "%s%s", key, OFS
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
        }
    }
}

$ awk -f tst.awk file1 file2
A       554
B       13
C       634     TRUE
D       84
E       9       TRUE
F               FALSE

If you don't care about the order the rows are output in then all you need is:

BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
    for (key in keys) {
        printf "%s%s", key, OFS
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
        }
    }
}



回答2:


You can access an arbitrary number of files via redirected getline on the ARGV list (bypassing awk's default file processing (via BEGIN and exit)):

awk 'BEGIN {
  for(i=1;i<=ARGC;++i){
    while (getline < ARGV[i]) {
      ...
      }
    }
  <END-type code>
  exit}' $(find -type f ...)



回答3:


Supposing this naming schema for the input files: 1 2 ....

   gawk '{ 
        names[$1]=$1
        data[$1,ARGIND]=$2
      } 
      END {
        for (i in names) {
           printf("%s\t",i)
           for (x=1;x<=ARGIND;x++) {
             printf("%s\t", data[i,x])
             }
           print ""
           }
       }' [0-9]* > combined_data.txt

Results:

A   554 
B   13  
C   634 TRUE
D   84  
E   9   TRUE
F       FALSE



回答4:


Another solution using join,bash,awk and tr, if file1, file2, file3, etc. are sorted

multijoin.sh

#!/bin/bash
function __t { 
  join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" | 
  awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}'; 
}
CMD="cat '$1'"
for i in `seq 2 $#`; do
  CMD="$CMD | __t '${@:$i:1}'";
done
eval "$CMD | tr '_' '\t' | tr ' ' '\t'";

or, recursive version

#!/bin/bash
function __t { 
  join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" | 
  awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}'; 
}
function __r { 
  if [[ "$#" -gt 1 ]]; then
    __t "$1" | __r "${@:2}"; 
  else
    __t "$1"; 
  fi
}
__r "${@:2}" < "$1" | tr '_' '\t' | tr ' ' '\t'

NOTE: the data cannot contain the character _, this was used as a wildcard

you get,

./multijoin file1 file2
A   554
B   13
C   634 TRUE
D   84
E   9   TRUE
F       FALSE

for example, if file3 contains

A    111
D    222
E    333
./multijoin file1 file2 file3

you get,

A   554       111
B   13      
C   634 TRUE    
D   84        222
E   9   TRUE  333
F       FALSE   


来源:https://stackoverflow.com/questions/33350432/awk-bash-combining-arbitrary-number-of-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!