combine files consisting of two columns

前端 未结 2 876
梦谈多话
梦谈多话 2021-01-14 10:26

I have multiple files that look like this:

file1:

rsRNA-2312-n    2
rsRNA-6508-n    2
rsRNA-6382-n    10
rsRNA-951-n 0
rsRNA-6330-n    4
rsRNA-6330-         


        
相关标签:
2条回答
  • 2021-01-14 10:29

    Another solution ....

    awk '
    { d[$1][FILENAME] = $2 }
    END{
        line = "identifier";
        for (i=1; i in ARGV; i++) line = line OFS ARGV[i];
        print line
        for(i in d){
            line = i;
            for (j=1; j in ARGV; j++){
                if(ARGV[j] in d[i]) line = line OFS d[i][ARGV[j]]
                else line = line OFS 0
            }
            print line
        }
    }' file1 file2 file3
    

    you get:

    identifier file1 file2 file3
    rsRNA-946-n 9 0 0
    rsRNA-4945-n 0 7 0
    rsRNA-1385-n 3 0 0
    rsRNA-2312-n 2 0 0
    rsRNA-6382-n 10 0 0
    rsRNA-951-n 0 0 0
    rsRNA-6490-n 0 2 0
    rsRNA-6330-n 11 0 0
    rsRNA-5301-n 0 7 0
    rsRNA-5058-n 0 0 0
    rsRNA-4946-n 0 0 1
    rsRNA-2445-n 0 9 0
    rsRNA-552-n 0 2 1
    rsRNA-6487-n 0 0 0
    rsRNA-4099-n 0 0 0
    rsRNA-849-n 0 0 2
    rsRNA-3302-n 0 0 2
    rsRNA-6508-n 2 0 0
    
    0 讨论(0)
  • 2021-01-14 10:45

    Using a shell loop to manipulate text is always the wrong approach. Just use awk, it's what it was designed to do. Using GNU awk 4.* for true multi-D arrays and ARGIND and sorted in:

    $ cat tst.awk
    {
        split($1,a,/-/)
        key = a[2]
        key2name[key] = $1
        key2val[key][ARGIND] = $2
    }
    END {
        printf "identifier"
        for (fileNr=1;fileNr<=ARGIND;fileNr++) {
            printf "%s%s", OFS, ARGV[fileNr]
        }
        print ""
    
        PROCINFO["sorted_in"] = "@ind_num_asc"
        for (key in key2name) {
            printf "%s", key2name[key]
            for (fileNr=1;fileNr<=ARGIND;fileNr++) {
                printf "%s%s", OFS, (fileNr in key2val[key] ? key2val[key][fileNr] : 0)
            }
            print ""
        }
    }
    

    .

    $ awk -f tst.awk file1 file2 file3
    identifier file1 file2 file3
    rsRNA-552-n 0 2 1
    rsRNA-849-n 0 0 2
    rsRNA-946-n 9 0 0
    rsRNA-951-n 0 0 0
    rsRNA-1385-n 3 0 0
    rsRNA-2312-n 2 0 0
    rsRNA-2445-n 0 9 0
    rsRNA-3302-n 0 0 2
    rsRNA-4099-n 0 0 0
    rsRNA-4945-n 0 7 0
    rsRNA-4946-n 0 0 1
    rsRNA-5058-n 0 0 0
    rsRNA-5301-n 0 7 0
    rsRNA-6330-n 11 0 0
    rsRNA-6382-n 10 0 0
    rsRNA-6487-n 0 0 0
    rsRNA-6490-n 0 2 0
    rsRNA-6508-n 2 0 0
    

    I added the slight additional complexity of the key as the numeric part of the first field so when outputting the results they can be sorted numerically on that sub-field.

    0 讨论(0)
提交回复
热议问题