I would like to ask for some hints in how to merge rows that share unique IDs into a comma separated table. Any hints in Perl, sed or awk are greatly appreciated.
This i
Using a Perl hash of arrays...
#!/usr/bin/perl
use warnings;
use strict;
my %data;
my $header;
while(<DATA>){
chomp;
if ($. == 1){
$header = $_;
next;
}
push @{ $data{(split)[0]} }, (split)[1];
}
print "$header\n";
for my $k (sort {$a<=>$b} keys %data){
print "$k\t";
print join(', ', @{ $data{$k} });
print "\n";
}
__DATA__
protein_id go_id
4102 GO:0003676
4125 GO:0003676
4125 GO:0008270
4139 GO:0008270
$ cat data.txt
protein_id go_id
4102 GO:0003676
4125 GO:0003676
4125 GO:0008270
4139 GO:0008270
$ perl -aE'sub a{say"$a\t",join", ",@a if$a;@a=($F[1]);$a=$F[0]}$F[0]eq$a?push@a,$F[1]:a()}{a()' data.txt
protein_id go_id
4102 GO:0003676
4125 GO:0003676, GO:0008270
4139 GO:0008270
Using awk
Input
$ cat file
protein_id go_id
4102 GO:0003676
4125 GO:0003676
4125 GO:0008270
4139 GO:0008270
Output (if order doesn't matter)
$ awk 'FNR==1{print;next}{A[$1]=$1 in A ? A[$1]", "$2:$2}END{for(i in A)print i,A[i]}' file
protein_id go_id
4139 GO:0008270
4102 GO:0003676
4125 GO:0003676, GO:0008270
Better Readable version
awk '
FNR==1{
print
next
}
{
A[$1]=$1 in A ? A[$1]", "$2:$2
}
END{
for(i in A)
print i,A[i]
}
' file
Output (if order is important)
$ awk 'FNR==1{print;next}$1 in A{A[$1]=A[$1]", "$2;next}{A[O[++c]=$1]=$2}END{for(i=1; i in O; i++)print O[i],A[O[i]]}' file
protein_id go_id
4102 GO:0003676
4125 GO:0003676, GO:0008270
4139 GO:0008270
Better Readable version
awk '
FNR==1{
print
next
}
$1 in A{
A[$1]=A[$1]", "$2
next
}
{
A[O[++c]=$1]=$2
}
END{
for(i=1; i in O; i++)
print O[i],A[O[i]]
}
' file