Join two files using awk

前端未结

关注

 2  463

I have two files like shown below which are tab-delimited:

file A

chr1   123 aa b c d
chr1   234 a  b c d
chr1   345 aa b c d
chr1   456 a  b c d
...


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-17 02:33
              
            
            
                                                                       
You can use join, but the pipeline gets so complicated it might be easier to switch to a more powerful language like Perl.

join -11 -21 -o1.1,1.2,1.3,1.4,1.5,2.4,2.5 \
     <(sed 's/ \+/:/' fileA | sort) \
     <(sed 's/ \+/:/' fileB | sort) \
 | join -11 -22 -a1 -o1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.5,2.6 \
     - <(sed 's/ \+\([^ ]\+\) \+\([^ ]\+\)/ \1:\2/' fileC | sort -k2) \
 | sed 's/:/ /'


Perl solution, using a hash to remember all the information:

#!/usr/bin/perl
use warnings;
use strict;

#             key_start  key_end  keep_from  output
my %files = (A => [0,      1,      2,       [0 .. 3]],
             B => [0,      1,      2,       [-2, -1]],
             C => [1,      2,      3,       [-2, -1]],
            );

my %hash;

for my $file (keys %files) {
    open my $FH, '<', "file$file" or die "file$file: $!";
    while (<$FH>) {
        my @fields = split;
        $hash{"@fields[$files{$file}[0], $files{$file}[1]]"}{$file}
            = [ @fields[$files{$file}[2] .. $#fields] ];
    }
}

for my $key (sort keys %hash) {
    print $key, join(' ', q(),
                     grep defined, map {
                         @{ $hash{$key}{$_} }[@{ $files{$_}[-1] }]
                     } sort keys %files), "\n";
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-17 02:45
              
            
            
                                                                       
Here is one approach using awk:

$ awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' OFS='\t' fileb filea
chr1    123     a    b    c     xxxx    abcd
chr1    234     a    b    c 
chr1    345     a    b    c     yyyy    defg
chr1    456     a    b    c 


Explanation:

NR==FNR             # current recond num match the file record num i.e in filea
a[$3,$4]=$1OFS$2    # Create entry in array with fields 3 and 4 as the key
next                # Grab the next line (don't process the next block)
$6=a[$1,$2]         # Assign the looked up value to field 6 (+rebuild records)  
print               # Print the current line & the matching entry from fileb ($6)

OFS='\t'            # Seperate each field with a single TAB on output


Edit:

For the 3 field problem you simple add the extra field:

$ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' fileb filea
chr1    123    aa     b      c     xxxx     abcd
chr1    234    a      b      c  
chr1    345    aa     b      c     yyyy     defg
chr1    456    a      b      c 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复