问题
I'm trying to compare column 1 from file1 and column 3 from file 2, if they match then print the first column from file1 and the two first columns from file2.
here's a sample from each file:
file1
Cre01.g000100
Cre01.g000500
Cre01.g000650
file2
chromosome_1 71569 |655|Cre01.g000500|protein_coding|CODING|PAC:26902937|1|1)
chromosome_1 93952 |765|Cre01.g000650|protein_coding|CODING|PAC:26903448|11|1)
chromosome_1 99034 |1027|Cre01.g000100 |protein_coding|CODING|PAC:26903318|9|1)
desired output
Cre01.g000100 chromosome_1 99034
Cre01.g000500 chromosome_1 71569
Cre01.g000650 chromosome_1 93952
I've been looking at various threads that are somewhat similar, but I can't seem to get it to print the columns from both files. Here are some links that are somewhat related:
awk compare 2 files, 2 fields different order in the file, print or merge match and non match lines
Obtain patterns from a file, compare to a column of another file, print matching lines, using awk
awk compare columns from two files, impute values of another column
Obtain patterns in one file from another using ack or awk or better way than grep?
Awk - combine the data from 2 files and print to 3rd file if keys matched
I feel like I should have been able to figure it out based on these threads, but it's been two days that I've been trying different variations of the codes and I haven't gotten anywhere. Here is some code that I've tried using on my files:
awk 'FNR==NR{a[$3]=$1;next;}{print $0 ($3 in a ? a[$3]:"NA")}' file1 file2
awk 'NR==FNR{ a[$1]; next} ($3 in a) {print $1 $2 a[$1]}' file1 file2
awk 'FNR==NR{a[$1]=$0; next}{print a[$1] $0}' file1 file2
I know i have to create a temp matrix that contains the first column of file1 (or the 3rd column of file2) then compare it to the other file. If there is a match, then print first column from file1 and column 1 and column 2 from file 2.
Thanks for the help!
回答1:
You can use this awk
:
awk -F '[| ]+' -v OFS='\t' 'NR==FNR{a[$4]=$1 OFS $2; next}
$1 in a{print $1, a[$1]}' file2 file1
Cre01.g000100 chromosome_1 99034
Cre01.g000500 chromosome_1 71569
Cre01.g000650 chromosome_1 93952
回答2:
Your middle attempt of the three is closest, but:
- You haven't specified the field delimiter is
|
. - You don't assign to
a[$1]
. Your sample output is inconsistent with your desired output (the sample output shows column 1 from file 1 and column 1 from file 2; the desired output is reputedly column 1 from file 1 and columns 1 and 2 from file 2, though this interpretation depends on the interpretation of
$3
in file 2 being the name between two pipe symbols).Citing the question at the time this answer was created:
… compare column 1 from file1 and column 3 from file 2, if they match then print the first column from file1 and the two first columns from file2.
desired output Cre01.g000100 chromosome_1 99034 Cre01.g000500 chromosome_1 71569 Cre01.g000650 chromosome_1 93952
We can observe that if
$3
in file 2 is equal to a value from file 1, then it is as easy to print$3
as a saved value.
So, fixing this up:
awk -F'|' 'NR==FNR { a[$1]=1; next } ($3 in a) { print $3, $1 }' file1 file2
The key change is the assignment to a[$1]
(and the -F'|'
); the rest is cosmetic and can be tweaked to suit your requirements (since the question is self-inconsistent, it is hard to give a better answer).
来源:https://stackoverflow.com/questions/31168521/awk-search-column-from-one-file-if-match-print-columns-from-both-files