问题
I am looking for something similar to the bash command comm, that I can use to select entries both unique to my 2 files and common to them. Comm worked great when I had just one column per file, eg.
comm -13 FILE1.txt FILE2.txt > Entries_only_in_file1.txt
But now I have multiple columns of info I wish to keep. I want to select column 2 as the one to filter rows for unique and common entries between my two files. If the entry in column two appears in both files I also want to record the info in columns 3,4,and 5 (if possible, this is not as important). Here is an example of input and output.
FILE1.txt
NM_023928 AACS 2 2 1
NM_182662 AADAT 2 2 1
NM_153698 AAED1 1 5 3
NM_001271 AAGAB 2 2 1
FILE2.txt
NM_153698 AAED1 2 5 3
NM_001271 AAGAB 2 2 1
NM_001605 AARS 3 40 37
NM_212533 ABCA2 3 4 2
Output wanted:
COMMON.txt
NM_153698 AAED1 1 5 3 2 5 3
NM_001271 AAGAB 2 2 1 2 2 1
UNIQUE_TO_1.txt
NM_023928 AACS 2 2 1
NM_182662 AADAT 2 2 1
UNIQUE_TO_2.txt
NM_001605 AARS 3 40 37
NM_212533 ABCA2 3 4 2
I know there has been similar questions before but I can't quite find what I'm looking for. Any ideas greatly appreciated, thank you.
回答1:
join
has the following options which are useful for your task:
-j FIELD
: join on fieldFIELD
-o FORMAT
: specify output format, as a comma separated list of FILENUM.FIELD.-v FILENUM
: output lines only onFILENUM
.
Common to both files:
$ join -j2 -o 1.1,1.2,1.3,1.4,1.5,2.3,2.4,2.5 FILE1.txt FILE2.txt
NM_153698 AAED1 1 5 3 2 5 3
NM_001271 AAGAB 2 2 1 2 2 1
Unique to FILE1:
$ join -j2 -v1 FILE1.txt FILE2.txt
AACS NM_023928 2 2 1
AADAT NM_182662 2 2 1
Unique to FILE2:
$ join -j2 -v2 FILE1.txt FILE2.txt
AARS NM_001605 3 40 37
ABCA2 NM_212533 3 4 2
回答2:
You can archieve that with gnu awk, here is a script:
script.awk
function unique(filename, line) {
split( line , tmp, FS)
print tmp[1], tmpp[2], tmp[3], tmp[4], tmp[5] >> filename
}
NR == FNR { # in case we are reading the first file: store line under key
file1[ $2 ] = $0
next
}
{
if( $2 in file1 ) { # key from file2 was in also in file1:
split( file1[ $2 ], tmp, FS)
print $1, $2, tmp[3], tmp[4], tmp[5], $3, $4, $5 >> "COMMON.txt"
# remove common key, thus we can later find unique keys from file1
delete file1[ $2 ]
}
else { # unique key from file2
unique("UNIQUE_TO_2.txt", $0)
}
}
END {
# remaining keys are unique in file1
for( k in file1 ) {
unique("UNIQUE_TO_1.txt", file1[ k ])
}
}
Use it like this:
# erase the output files if present
rm -f COMMON.txt UNIQUE_TO_1.txt UNIQUE_TO_2.txt
# run script, create the file
awk -f script.awk FILE1.txt FILE2.txt
# output the files
for f in COMMON.txt UNIQUE_TO_1.txt UNIQUE_TO_2.txt; do echo "$f"; cat "$f"; done
The printf ... >> filename
appends the text to filename. This requires the rm
of the output files when running the script a second time.
来源:https://stackoverflow.com/questions/37170326/bash-comm-command-but-for-multiple-columns