how to merge rows that share unique IDs into a comma separated table

后端 未结 3 1612
温柔的废话
温柔的废话 2021-01-29 09:13

I would like to ask for some hints in how to merge rows that share unique IDs into a comma separated table. Any hints in Perl, sed or awk are greatly appreciated.

This i

相关标签:
3条回答
  • 2021-01-29 09:34

    Using a Perl hash of arrays...

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my %data;
    my $header;
    
    while(<DATA>){
        chomp;
    
        if ($. == 1){
            $header = $_;
            next;
        }
        push @{ $data{(split)[0]} }, (split)[1];
    }
    
    print "$header\n";
    
    for my $k (sort {$a<=>$b} keys %data){
    
        print "$k\t";
        print join(', ', @{ $data{$k} });
        print "\n";
    }
    
    __DATA__
    protein_id go_id
    4102    GO:0003676
    4125    GO:0003676
    4125    GO:0008270
    4139    GO:0008270
    
    0 讨论(0)
  • 2021-01-29 09:47
    $ cat data.txt 
    protein_id go_id
    4102    GO:0003676
    4125    GO:0003676
    4125    GO:0008270
    4139    GO:0008270
    $ perl -aE'sub a{say"$a\t",join", ",@a if$a;@a=($F[1]);$a=$F[0]}$F[0]eq$a?push@a,$F[1]:a()}{a()' data.txt
    protein_id      go_id
    4102    GO:0003676
    4125    GO:0003676, GO:0008270
    4139    GO:0008270
    
    0 讨论(0)
  • 2021-01-29 09:54

    Using awk

    Input

    $ cat file
    protein_id go_id
    4102    GO:0003676
    4125    GO:0003676
    4125    GO:0008270
    4139    GO:0008270
    

    Output (if order doesn't matter)

    $ awk 'FNR==1{print;next}{A[$1]=$1 in A ? A[$1]", "$2:$2}END{for(i in A)print i,A[i]}' file
    protein_id go_id
    4139 GO:0008270
    4102 GO:0003676
    4125 GO:0003676, GO:0008270
    

    Better Readable version

    awk '
          FNR==1{
                  print
                  next
                }
                {
                  A[$1]=$1 in A ? A[$1]", "$2:$2
                }
             END{
                  for(i in A)
                       print i,A[i]
                }
        ' file
    

    Output (if order is important)

    $ awk 'FNR==1{print;next}$1 in A{A[$1]=A[$1]", "$2;next}{A[O[++c]=$1]=$2}END{for(i=1; i in O; i++)print O[i],A[O[i]]}' file
    protein_id go_id
    4102 GO:0003676
    4125 GO:0003676, GO:0008270
    4139 GO:0008270
    

    Better Readable version

    awk '
         FNR==1{
                 print
                 next
               }
        $1 in A{
                 A[$1]=A[$1]", "$2
                 next
               }
               {
                A[O[++c]=$1]=$2
               }
            END{
                 for(i=1; i in O; i++)
                      print O[i],A[O[i]]
               }
        ' file
    
    0 讨论(0)
提交回复
热议问题