I am having problems with grep and awk. I think it\'s because my input file contains text that looks like code.
The input file contains ID names and looks like this:
$ fgrep -f source.file reference.file
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
fgrep
is equivalent to grep -F
:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
The -f
option is for taking PATTERN
from a file:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
As noted in the comments, this can produce false positives if an ID in reference.file
contains an ID in source.file
as a substring. You can construct a more definitive pattern for grep
on the fly with sed
:
grep -f <( sed 's/.*/ &$/' input.file) reference.file
But this way the patterns are interpreted as regular expressions and not as fixed strings, which is potentially vulnerable (although may be OK if the IDs only contain alphanumeric characters). The better way, though (thanks to @sidharthcnadhan), is to use the -w
option:
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
So the final answer to your question is:
grep -Fwf source.file reference.file
This was a nice bash
ish try. The problem was that You always overwrite the result file. Use '>>' instead of >
or move the >
after done
grep -w $line reference.file >> outputfile
or
done > outputfile
But I would prefer Lev's solution as it starts an external process only once.
If You want to solve it in pure bash
, you could try this:
ID=($(<IDfile))
while read; do
for((i=0;i<${#ID[*]};++i)) {
[[ $REPLY =~ [[:space:]]${ID[$i]}$ ]] && echo $REPLY && break
}
done <RefFile >outputfile
cat outputfile
Output:
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
Newer bash
supports associative arrays. It can be used to simplify and speed up the search for a key:
declare -A ID
for i in $(<IDfile); { ID[$i]=1;}
while read v; do
[[ $v =~ [[:space:]]([^[:space:]]+)$ && ${ID[${BASH_REMATCH[1]}]} = 1 ]] && echo $v
done <RefFile
This will do the trick:
$ awk 'NR==FNR{a[$0];next}$NF in a{print}' input reference
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2