UPDATE
I suspect that the input and desired output data I initially put in wasn\'t exactly the same as I what I have with respect to whitespace. I\'
You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk
is well suited:
#!/usr/bin/awk -f
BEGIN {
FS = "\t";
OFS = FS;
}
NR == 1 {
# collect sample names
for (i=1; i <= NF; i++) {
sample[i] = $i
}
}
NR == 2 {
# first four columns are always the same
cols[1] = 1
cols[2] = 3
cols[3] = 4
cols[4] = 5
printf "%s %s %s %s ", sample[1], $3, $4, $5
# dynamic columns (in practice: 2,6,10,...)
for (i=1; i <= NF; i++) {
if ($i == "Beta_value") {
cols[length(cols)+1] = i
printf "%s ", sample[i]
}
}
printf "\n"
}
NR >= 3 {
# print cols from data row
for (i=1; i <= length(cols); i++) {
printf "%s ", $cols[i]
}
printf "\n"
}
This gives your desired output. If you want more speed, you might consider using awk
simply to print the column numbers (which only requires reading the two header rows), then cut
to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut
command you need to print all the data rows is something like this:
cut -d '\t' -f 1,3,4,5,2,6