Calculate mean of each column ignoring missing data with awk

后端 未结 2 827
死守一世寂寞
死守一世寂寞 2021-01-15 00:43

I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as \"na\". For example,

na  0.93    na  0         


        
相关标签:
2条回答
  • 2021-01-15 01:18

    This is obscure, but works for your example

    awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt
    

    EDIT: Here is how it works:

    awk '{for(i=1; i<=NF; i++){ #for each column
            sum[i] += $i;       #add the sum to the "sum" array
            if($i != "na"){     #if value is not "na"
               count[i]+=1}     #increment the column "count"
            }                   #endif
         }                      #endfor
        END {                    #at the end
         for(i=1; i<=NF; i++){  #for each column
            if(count[i]!=0){        #if the column count is not 0
                v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
            }else{                  #else (if column count is 0)
                v = 0               #then let mean be 0 (note: you can set this to be "na")
            };                      #endif col count is not 0
            if(i<NF){               #if the column is before the last column
                printf "%f\t",v     #print mean + TAB
            }else{                  #else (if it is the last column)
                print v}            #print mean + NEWLINE
            };                      #endif
         }' input.txt               #endfor (note: input.txt is the input file)
    

    ```

    0 讨论(0)
  • 2021-01-15 01:31

    A possible solution:

    awk -F"\t" '{for(i=1; i <= NF; i++)
                    {if($i == $i+0){sum[i]+=$i; denom[i] += 1;}}}
                END{for(i=1; i<= NF; i++){line=line""sum[i]/(denom[i]?denom[i]:1)FS} 
                    print line}' inputFile
    

    The output for the given data:

    0.973333    0.9825  0   0.7425  0.01    0.7125
    

    Note that the third column contains only "na" and the output is 0. If you want the output to be na, then change the END{...}-block to:

    END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS} print line}'

    0 讨论(0)
提交回复
热议问题