An efficient way to transpose a file in Bash

前端 未结 29 2077
时光说笑
时光说笑 2020-11-22 03:30

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like t

相关标签:
29条回答
  • 2020-11-22 03:37

    A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my $counter;
    open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
    while (my $line = <INPUT>) {
        chomp $line;
        my @array = split ("\t",$line);
        open OUTPUT, ">temp$." or die ("unable to open output file!");
        print OUTPUT join ("\n",@array);
        close OUTPUT;
        $counter=$.;
    }
    close INPUT;
    
    # paste files together
    my $execute = "paste ";
    foreach (1..$counter) {
        $execute.="temp$counter ";
    }
    $execute.="> $ARGV[1]";
    system $execute;
    
    0 讨论(0)
  • 2020-11-22 03:39

    Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awk solution.

    #!/bin/perl -w
    #
    # SO 1729824
    
    use strict;
    
    my(%data);          # main storage
    my($maxcol) = 0;
    my($rownum) = 0;
    while (<>)
    {
        my(@row) = split /\s+/;
        my($colnum) = 0;
        foreach my $val (@row)
        {
            $data{$rownum}{$colnum++} = $val;
        }
        $rownum++;
        $maxcol = $colnum if $colnum > $maxcol;
    }
    
    my $maxrow = $rownum;
    for (my $col = 0; $col < $maxcol; $col++)
    {
        for (my $row = 0; $row < $maxrow; $row++)
        {
            printf "%s%s", ($row == 0) ? "" : "\t",
                    defined $data{$row}{$col} ? $data{$row}{$col} : "";
        }
        print "\n";
    }
    

    With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.


    Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

    Osiris JL: time gawk -f tr.awk xxx  > /dev/null
    
    real    0m0.367s
    user    0m0.279s
    sys 0m0.085s
    Osiris JL: time perl -f transpose.pl xxx > /dev/null
    
    real    0m0.138s
    user    0m0.128s
    sys 0m0.008s
    Osiris JL: time awk -f tr.awk xxx  > /dev/null
    
    real    0m1.891s
    user    0m0.924s
    sys 0m0.961s
    Osiris-2 JL: 
    

    Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.

    0 讨论(0)
  • 2020-11-22 03:40

    Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.

    Call the input file foo.

    1. If we know foo has four columns:

      for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
      
    2. If we don't know how many columns foo has:

      n=$(head -n 1 foo | wc -w)
      for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
      

      xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:

      { timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
      

      Maximum length of command we could actually use: 2088944

    3. tr & echo:

      for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
      

      ...or if the # of columns are unknown:

      n=$(head -n 1 foo | wc -w)
      for f in $(seq 1 $n); do 
          cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
      done
      
    4. Using set, which like xargs, has similar command line size based limitations:

      for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done
      
    0 讨论(0)
  • 2020-11-22 03:41

    Another awk solution and limited input with the size of memory you have.

    awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
        END{ for (i in RtoC) print RtoC[i] }' infile
    

    This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc. Will output:

    X row1 row2 row3 row4
    column1 0 3 6 9
    column2 1 4 7 10
    column3 2 5 8 11
    
    0 讨论(0)
  • 2020-11-22 03:43

    A Python solution:

    python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output
    

    The above is based on the following:

    import sys
    
    for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
        print(' '.join(c))
    

    This code does assume that every line has the same number of columns (no padding is performed).

    0 讨论(0)
  • 2020-11-22 03:45

    I normally use this little awk snippet for this requirement:

      awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
            max=(max<NF?NF:max)}
            END {for (i=1; i<=max; i++)
                  {for (j=1; j<=NR; j++) 
                      printf "%s%s", a[i,j], (j==NR?RS:FS)
                  }
            }' file
    

    This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.

    This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.

    0 讨论(0)
提交回复
热议问题