An efficient way to transpose a file in Bash

前端未结

关注

 29  2123

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like t

相关标签:

29条回答

深忆病人

2020-11-22 03:37

A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste

#!/usr/bin/perl
use warnings;
use strict;

my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
    chomp $line;
    my @array = split ("\t",$line);
    open OUTPUT, ">temp$." or die ("unable to open output file!");
    print OUTPUT join ("\n",@array);
    close OUTPUT;
    $counter=$.;
}
close INPUT;

# paste files together
my $execute = "paste ";
foreach (1..$counter) {
    $execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;

0 讨论(0)

独厮守ぢ

2020-11-22 03:39

Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awk solution.

#!/bin/perl -w
#
# SO 1729824

use strict;

my(%data);          # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.

Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

Osiris JL: time gawk -f tr.awk xxx  > /dev/null

real    0m0.367s
user    0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null

real    0m0.138s
user    0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx  > /dev/null

real    0m1.891s
user    0m0.924s
sys 0m0.961s
Osiris-2 JL:

Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.

0 讨论(0)

半阙折子戏

2020-11-22 03:40
Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.

Call the input file foo.
1. If we know foo has four columns:
```
for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
```
2. If we don't know how many columns foo has:
```
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
```
  xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:
```
{ timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
```
  Maximum length of command we could actually use: 2088944
3. tr & echo:
```
for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
```
  ...or if the # of columns are unknown:
```
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n); do 
    cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
done
```
4. Using set, which like xargs, has similar command line size based limitations:
```
for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2020-11-22 03:41
Another awk solution and limited input with the size of memory you have.
```
awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
    END{ for (i in RtoC) print RtoC[i] }' infile
```
This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc. Will output:
```
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

挽巷

2020-11-22 03:43

A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).

0 讨论(0)

终归单人心

2020-11-22 03:45
I normally use this little awk snippet for this requirement:
```
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
        max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j==NR?RS:FS)
              }
        }' file
```
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.

This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 5 下一页