I want to compare 3 files together to see how much of the information in the files are the same. The file format is something like this:
Chr11 447 . A C 74 . DP=22;AF1=1;CI95=1,1;DP4=0,0,9,8;MQ=15;FQ=-78 GT:PL:GQ 1/1:107,51,0:99
Chr10 449 . G C 35 . DP=26;AF1=0.5;CI95=0.5,0.5;DP4=5,0,7,8;MQ=20;FQ=11.3;PV4=0.055,0.0083,0.028,1 GT:PL:GQ 0/1:65,0,38:41
Chr12 517 . G A 222 . DP=122;AF1=1;CI95=1,1;DP4=0,0,77,40;MQ=23;FQ=-282 GT:PL:GQ 1/1:255,255,0:99
Chr10 761 . G A 41 . DP=93;AF1=0.5;CI95=0.5,0.5;DP4=11,34,6,35;MQ=19;FQ=44;PV4=0.29,1.8e-35,1,1 GT:PL:GQ 0/1:71,0,116:74
I'm only interested in the first two columns (if the first two columns are the same then I consider it as equal). This is the comand that I use for comparing two files :
awk 'FILENAME==ARGV[1] {pair[$1 " " $2]; next} ($1 " " $2 in pair)' file1 file2 | wc -l
I would like to use the awk command since my files are really big and awk handle them really good! but I couldn't figure out how to use it for 3 files!
If it's simply to print out the pairs (column1 + column2) that are common in all 3 files, and making use of the fact that a pair is unique within a file, you could do it this way:
awk '{print $1" "$2}' a b c | sort | uniq -c | awk '{if ($1==3){print $2" "$3}}'
This can be made with arbitrary numbers of files as long as you modify the param of the last command.
Here's what it does:
- prints and sorts the first 2 columns of all files (
awk '{print $1" "$2}' a b c | sort
) - count the number of duplicate entries (
uniq -c
) - if duplicate entry count == number of files, we found a match. print it.
If you're doing this often, you can express it as a bash function (and drop it in your .bashrc
) which parametrises the file counts.
function common_pairs {
awk '{print $1" "$2}' $@ | sort | uniq -c | awk -v numf=$# '{if ($1==numf){print $2" "$3}}';
}
Call it with any number of files you want: common_pairs file1 file2 file3 fileN
For this I'd use the commands cut, sort and comm.
With cut cut away the fields not needed.
sort the outcome since comm expects sorted input.
Use comm to get the lines which are in file1 and file2.
Use comm again to get the lines that are also in file3.
A script could look like this:
for i in 1 2 3
do
# options to cut may have to be adjusted for your input files
cut -c1-15 file$i | sort > tmp.$i
done
comm -12 tmp.1 tmp.2 > tmp.1+2
comm -12 tmp.3 tmp.1+2 > tmp.1+2+3
(Of course one may use extended shell syntax to avoid temporary files, but I don't want to hide the idea behind complex syntax expressions)
In file tmp.1+2+3
you now should have the keys present in all three files. If you're interested in the whole lines, you may use the command join in combination with a sorted version of any of the thee input files)
Just read your last comment - You want the files joined, but duplicates removed?
sort file1 file2 file3 | uniq > newfile
Not intended to start an editor war, but I am familiar with VI, and vimdiff and its variants show the comparison between multiple files in parallel view, which I find very handy. Simply you can call it with
$ vimdiff <filelist>
来源:https://stackoverflow.com/questions/7964807/how-can-i-compare-3-files-together-to-see-what-is-in-common-between-them