I have two files like shown below which are tab-delimited:
file A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
...
You can use join
, but the pipeline gets so complicated it might be easier to switch to a more powerful language like Perl.
join -11 -21 -o1.1,1.2,1.3,1.4,1.5,2.4,2.5 \
<(sed 's/ \+/:/' fileA | sort) \
<(sed 's/ \+/:/' fileB | sort) \
| join -11 -22 -a1 -o1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.5,2.6 \
- <(sed 's/ \+\([^ ]\+\) \+\([^ ]\+\)/ \1:\2/' fileC | sort -k2) \
| sed 's/:/ /'
Perl solution, using a hash to remember all the information:
#!/usr/bin/perl
use warnings;
use strict;
# key_start key_end keep_from output
my %files = (A => [0, 1, 2, [0 .. 3]],
B => [0, 1, 2, [-2, -1]],
C => [1, 2, 3, [-2, -1]],
);
my %hash;
for my $file (keys %files) {
open my $FH, '<', "file$file" or die "file$file: $!";
while (<$FH>) {
my @fields = split;
$hash{"@fields[$files{$file}[0], $files{$file}[1]]"}{$file}
= [ @fields[$files{$file}[2] .. $#fields] ];
}
}
for my $key (sort keys %hash) {
print $key, join(' ', q(),
grep defined, map {
@{ $hash{$key}{$_} }[@{ $files{$_}[-1] }]
} sort keys %files), "\n";
}
Here is one approach using awk
:
$ awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' OFS='\t' fileb filea
chr1 123 a b c xxxx abcd
chr1 234 a b c
chr1 345 a b c yyyy defg
chr1 456 a b c
Explanation:
NR==FNR # current recond num match the file record num i.e in filea
a[$3,$4]=$1OFS$2 # Create entry in array with fields 3 and 4 as the key
next # Grab the next line (don't process the next block)
$6=a[$1,$2] # Assign the looked up value to field 6 (+rebuild records)
print # Print the current line & the matching entry from fileb ($6)
OFS='\t' # Seperate each field with a single TAB on output
Edit:
For the 3 field problem you simple add the extra field:
$ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' fileb filea
chr1 123 aa b c xxxx abcd
chr1 234 a b c
chr1 345 aa b c yyyy defg
chr1 456 a b c