An efficient way to transpose a file in Bash

前端 未结 29 2124
时光说笑
时光说笑 2020-11-22 03:30

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like t

相关标签:
29条回答
  • 2020-11-22 03:51

    Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:

    echo '' > tmp1;  \
    cat m.txt | while read l ; \
                do    paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
                      cp tmp2 tmp1; \
                done; \
    cat tmp1
    

    m.txt:

    0 1 2
    4 5 6
    7 8 9
    10 11 12
    
    1. creates tmp1 file so it's not empty.

    2. reads each line and transforms it into a column using tr

    3. pastes the new column to the tmp1 file

    4. copies result back into tmp1.

    PS: I really wanted to use io-descriptors but couldn't get them to work.

    0 讨论(0)
  • 2020-11-22 03:52

    the transpose project on sourceforge is a coreutil-like C program for exactly that.

    gcc transpose.c -o transpose
    ./transpose -t input > output #works with stdin, too.
    
    0 讨论(0)
  • 2020-11-22 03:52

    Another bash variant

    $ cat file 
    XXXX    col1    col2    col3
    row1    0       1       2
    row2    3       4       5
    row3    6       7       8
    row4    9       10      11
    

    Script

    #!/bin/bash
    
    I=0
    while read line; do
        i=0
        for item in $line; { printf -v A$I[$i] $item; ((i++)); }
        ((I++))
    done < file
    indexes=$(seq 0 $i)
    
    for i in $indexes; {
        J=0
        while ((J<I)); do
            arr="A$J[$i]"
            printf "${!arr}\t"
            ((J++))
        done
        echo
    }
    

    Output

    $ ./test 
    XXXX    row1    row2    row3    row4    
    col1    0       3       6       9   
    col2    1       4       7       10  
    col3    2       5       8       11
    
    0 讨论(0)
  • 2020-11-22 03:53
    awk '
    { 
        for (i=1; i<=NF; i++)  {
            a[NR,i] = $i
        }
    }
    NF>p { p = NF }
    END {    
        for(j=1; j<=p; j++) {
            str=a[1,j]
            for(i=2; i<=NR; i++){
                str=str" "a[i,j];
            }
            print str
        }
    }' file
    

    output

    $ more file
    0 1 2
    3 4 5
    6 7 8
    9 10 11
    
    $ ./shell.sh
    0 3 6 9
    1 4 7 10
    2 5 8 11
    

    Performance against Perl solution by Jonathan on a 10000 lines file

    $ head -5 file
    1 0 1 2
    2 3 4 5
    3 6 7 8
    4 9 10 11
    1 0 1 2
    
    $  wc -l < file
    10000
    
    $ time perl test.pl file >/dev/null
    
    real    0m0.480s
    user    0m0.442s
    sys     0m0.026s
    
    $ time awk -f test.awk file >/dev/null
    
    real    0m0.382s
    user    0m0.367s
    sys     0m0.011s
    
    $ time perl test.pl file >/dev/null
    
    real    0m0.481s
    user    0m0.431s
    sys     0m0.022s
    
    $ time awk -f test.awk file >/dev/null
    
    real    0m0.390s
    user    0m0.370s
    sys     0m0.010s
    

    EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

    Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    {
        for (rowNr=1;rowNr<=NF;rowNr++) {
            cell[rowNr,NR] = $rowNr
        }
        maxRows = (NF > maxRows ? NF : maxRows)
        maxCols = NR
    }
    END {
        for (rowNr=1;rowNr<=maxRows;rowNr++) {
            for (colNr=1;colNr<=maxCols;colNr++) {
                printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
            }
        }
    }
    
    $ awk -f tst.awk file
    X       row1    row2    row3    row4
    column1 0       3       6       9
    column2 1       4       7       10
    column3 2       5       8       11
    

    The above solutions will work in any awk (except old, broken awk of course - there YMMV).

    The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    { printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
    ENDFILE {
        print ""
        if (ARGIND < NF) {
            ARGV[ARGC] = FILENAME
            ARGC++
        }
    }
    $ awk -f tst.awk file
    X       row1    row2    row3    row4
    column1 0       3       6       9
    column2 1       4       7       10
    column3 2       5       8       11
    

    which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.

    0 讨论(0)
  • 2020-11-22 03:55

    GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!

    datamash -W transpose infile > outfile
    
    0 讨论(0)
  • 2020-11-22 03:55

    I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...

    #!/bin/bash 
    declare -a array=( )                      # we build a 1-D-array
    declare -a ncols=( )                      # we build a 1-D-array containing number of elements of each row
    
    SEPARATOR="\t";
    PADDING="";
    MAXROWS=0;
    index=0
    indexCol=0
    while read -a line; do
        ncols[$indexCol]=${#line[@]};
    ((indexCol++))
    if [ ${#line[@]} -gt ${MAXROWS} ]
        then
             MAXROWS=${#line[@]}
        fi    
        for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
            array[$index]=${line[$COUNTER]}
            ((index++))
    
        done
    done < "$1"
    
    for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
      COUNTER=$ROW;
      for (( indexCol=0; indexCol < ${#ncols[@]}; indexCol++ )); do
    if [ $ROW -ge ${ncols[indexCol]} ]
        then
          printf $PADDING
        else
      printf "%s" ${array[$COUNTER]}
    fi
    if [ $((indexCol+1)) -lt ${#ncols[@]} ]
    then
      printf $SEPARATOR
        fi
        COUNTER=$(( COUNTER + ncols[indexCol] ))
      done
      printf "\n" 
    done
    
    0 讨论(0)
提交回复
热议问题