How to selectively remove columns and rows with bash or python

后端 未结 1 1974
独厮守ぢ
独厮守ぢ 2021-01-16 15:10

UPDATE

I suspect that the input and desired output data I initially put in wasn\'t exactly the same as I what I have with respect to whitespace. I\'

相关标签:
1条回答
  • 2021-01-16 15:45

    You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk is well suited:

    #!/usr/bin/awk -f
    
    BEGIN {
        FS = "\t";
        OFS = FS;
    }
    
    NR == 1 {
        # collect sample names                                                                                                                                                               
        for (i=1; i <= NF; i++) {
            sample[i] = $i
        }
    }
    
    NR == 2 {
        # first four columns are always the same                                                                                                                                             
        cols[1] = 1
        cols[2] = 3
        cols[3] = 4
        cols[4] = 5
        printf "%s %s %s %s ", sample[1], $3, $4, $5
    
        # dynamic columns (in practice: 2,6,10,...)                                                                                                                                          
        for (i=1; i <= NF; i++) {
            if ($i == "Beta_value") {
                cols[length(cols)+1] = i
                printf "%s ", sample[i]
            }
        }
        printf "\n"
    }
    
    NR >= 3 {
        # print cols from data row                                                                                                                                                           
        for (i=1; i <= length(cols); i++) {
            printf "%s ", $cols[i]
        }
        printf "\n"
    }
    

    This gives your desired output. If you want more speed, you might consider using awk simply to print the column numbers (which only requires reading the two header rows), then cut to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut command you need to print all the data rows is something like this:

    cut -d '\t' -f 1,3,4,5,2,6
    
    0 讨论(0)
提交回复
热议问题