Join two files using awk

前端 未结 2 461
南笙
南笙 2020-12-17 02:24

I have two files like shown below which are tab-delimited:

file A

chr1   123 aa b c d
chr1   234 a  b c d
chr1   345 aa b c d
chr1   456 a  b c d
...         


        
相关标签:
2条回答
  • 2020-12-17 02:33

    You can use join, but the pipeline gets so complicated it might be easier to switch to a more powerful language like Perl.

    join -11 -21 -o1.1,1.2,1.3,1.4,1.5,2.4,2.5 \
         <(sed 's/ \+/:/' fileA | sort) \
         <(sed 's/ \+/:/' fileB | sort) \
     | join -11 -22 -a1 -o1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.5,2.6 \
         - <(sed 's/ \+\([^ ]\+\) \+\([^ ]\+\)/ \1:\2/' fileC | sort -k2) \
     | sed 's/:/ /'
    

    Perl solution, using a hash to remember all the information:

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    #             key_start  key_end  keep_from  output
    my %files = (A => [0,      1,      2,       [0 .. 3]],
                 B => [0,      1,      2,       [-2, -1]],
                 C => [1,      2,      3,       [-2, -1]],
                );
    
    my %hash;
    
    for my $file (keys %files) {
        open my $FH, '<', "file$file" or die "file$file: $!";
        while (<$FH>) {
            my @fields = split;
            $hash{"@fields[$files{$file}[0], $files{$file}[1]]"}{$file}
                = [ @fields[$files{$file}[2] .. $#fields] ];
        }
    }
    
    for my $key (sort keys %hash) {
        print $key, join(' ', q(),
                         grep defined, map {
                             @{ $hash{$key}{$_} }[@{ $files{$_}[-1] }]
                         } sort keys %files), "\n";
    }
    
    0 讨论(0)
  • 2020-12-17 02:45

    Here is one approach using awk:

    $ awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' OFS='\t' fileb filea
    chr1    123     a    b    c     xxxx    abcd
    chr1    234     a    b    c 
    chr1    345     a    b    c     yyyy    defg
    chr1    456     a    b    c 
    

    Explanation:

    NR==FNR             # current recond num match the file record num i.e in filea
    a[$3,$4]=$1OFS$2    # Create entry in array with fields 3 and 4 as the key
    next                # Grab the next line (don't process the next block)
    $6=a[$1,$2]         # Assign the looked up value to field 6 (+rebuild records)  
    print               # Print the current line & the matching entry from fileb ($6)
    
    OFS='\t'            # Seperate each field with a single TAB on output
    

    Edit:

    For the 3 field problem you simple add the extra field:

    $ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' fileb filea
    chr1    123    aa     b      c     xxxx     abcd
    chr1    234    a      b      c  
    chr1    345    aa     b      c     yyyy     defg
    chr1    456    a      b      c 
    
    0 讨论(0)
提交回复
热议问题