command line pivot | 易学教程

问题

I've been hunting around the past few days for a set of command line tools, a perl or an awk script that allow me to very quickly transpose the following data:

Row|Col|Val
1|A|foo
1|B|bar
1|C|I have a real
2|A|bad
2|C|hangover

into this:

A|B|C
foo|bar|I have a real
bad||hangover

Note that there is only one value in the dataset for each "cell" (i.e., as with a spreadsheet, there aren't any duplicates of Row "1" Col "A")

I've tried various awk shell implementations for transposing data - but can't seem to get them working. One idea I had was to cut each "Col" value into a separate file, then use the "join" command line to put them back together by "Row" -- but there MUST be an easier way. I'm sure this is just incredibly simple to do - but I'm struggling a bit.

My input files have Cols A through G (mostly including variable length strings), and 10,000 Rows. If I can avoid loading everything into memory that would be a huge plus.

Beer-by-mail for anyone who's got the answer!

As always - many thanks in advance for your help.

Cheers,

Josh

p.s. - I'm a bit surprised that there isn't an out-of-the-box command line util for doing this very basic type of pivot/transposition operation. I looked at http://code.google.com/p/openpivot/ and at http://code.google.com/p/crush-tools/ both of which seem to require aggregate calcs.

回答1:

I can do this in gawk, but not nawk.

#!/usr/local/bin/gawk -f

BEGIN {
  FS="|";
}

{
  rows[$1]=1; cols[$2]=1; values[$1][$2]=$3;
}

END {
  for (col in cols) {
    output=output sprintf("|%s", col);
  }
  print substr(output, 2);
  for (row in rows) {
    output="";
    for (col in cols) {
      output=output sprintf("|%s", values[row][col]);
    }
    print substr(output, 2);
  }
}

And it even works:

ghoti@pc $ cat data
1|A|foo
1|B|bar
1|C|I have a real
2|A|bad
2|C|hangover
ghoti@pc $ ./doit.gawk data
A|B|C
foo|bar|I have a real
bad||hangover
ghoti@pc $

I'm not sure how well this will work with 10000 rows, but I suspect if you've got the memory for it, you'll be fine. I can't see how you can avoid loading things into memory except by storing things in separate files which you'd later join. Which is pretty much a manual implementation of virtual memory.

UPDATE:

Per comments:

#!/usr/local/bin/gawk -f

BEGIN {
  FS="|";
}

{
  rows[$1]=1; cols[$2]=1; values[$1,$2]=$3;
}

END {
  for (col in cols) {
    output=output sprintf("|%s", col);
  }
  print output;
  for (row in rows) {
    output="";
    for (col in cols) {
      output=output "|" values[row,col];
    }
    print row output;
  }
}

And the output:

ghoti@pc $ ./doit.awk data
|A|B|C
1|foo|bar|I have a real
2|bad||hangover
ghoti@pc $

回答2:

Just use a hash. If you don't want to load them into memory, you may need modules like DBM::Deep and a DBM backend.

my %table;

my $maxa = 'A';
my $maxr = 0;

<>;

while (<>) {
    chomp;
    my ($a,$b,$c) = split /\|/;
    $table{$a}->{$b} = $c;

    $maxr = $a if ($a > $maxr);
    $maxa = $b if ($b gt $maxa);
}

for (my $c = 'A' ; $c lt $maxa ; $c++) {
    print $c . '|';
}
print "$maxa\n";

for (my $r = 1 ; $r <= $maxr ; $r++) {
    for (my $c = 'A' ; $c lt $maxa ; $c++) {
        print $table{$r}->{$c} . '|';
    }
    print $table{$r}->{$maxa} . "\n";
}

回答3:

If you know Awk, I'd recommend you look at Perl. Perl is just much more powerful than Awk. The advantage is that if you know BASH/Bourne shell and Awk, much of the syntax in Perl will be familiar.

Another nice thing about Perl is the entire CPAN repository which allows you to download already written Perl modules to use in your program. A quick search in CPAN brought up Data::Pivot which looks like (at a very quick glance) it might do what you want.

If not, take a look at Acme::Tools pivot command. Or try one of the many others.

Others have already provided a few solutions, but I recommend you look at what the CPAN Perl archive has. It's a very powerful tool for things like this.

来源：https://stackoverflow.com/questions/9475806/command-line-pivot

标签

perl

bash

awk

pivot-table

gawk