Matching text using grep or awk

前端 未结 3 1555
轮回少年
轮回少年 2021-01-06 10:45

I am having problems with grep and awk. I think it\'s because my input file contains text that looks like code.

The input file contains ID names and looks like this:

相关标签:
3条回答
  • 2021-01-06 11:16
    $ fgrep -f source.file reference.file 
    ENSG00000199537 SNORD115-40
    ENSG00000207793 MIR432
    ENSG00000207447 RNU6-2
    

    fgrep is equivalent to grep -F:

       -F, --fixed-strings
              Interpret  PATTERN  as  a  list  of  fixed strings, separated by
              newlines, any of which is to be matched.  (-F  is  specified  by
              POSIX.)
    

    The -f option is for taking PATTERN from a file:

       -f FILE, --file=FILE
              Obtain  patterns  from  FILE,  one  per  line.   The  empty file
              contains zero patterns, and therefore matches nothing.   (-f  is
              specified by POSIX.)
    

    As noted in the comments, this can produce false positives if an ID in reference.file contains an ID in source.file as a substring. You can construct a more definitive pattern for grep on the fly with sed:

    grep -f <( sed 's/.*/ &$/' input.file) reference.file
    

    But this way the patterns are interpreted as regular expressions and not as fixed strings, which is potentially vulnerable (although may be OK if the IDs only contain alphanumeric characters). The better way, though (thanks to @sidharthcnadhan), is to use the -w option:

       -w, --word-regexp
              Select  only  those  lines  containing  matches  that form whole
              words.  The test is that the matching substring must  either  be
              at  the  beginning  of  the  line,  or  preceded  by  a non-word
              constituent character.  Similarly, it must be either at the  end
              of  the  line  or  followed by a non-word constituent character.
              Word-constituent  characters  are  letters,  digits,   and   the
              underscore.
    

    So the final answer to your question is:

    grep -Fwf source.file reference.file
    
    0 讨论(0)
  • 2021-01-06 11:19

    This was a nice bashish try. The problem was that You always overwrite the result file. Use '>>' instead of > or move the > after done

    grep -w $line reference.file >> outputfile
    

    or

    done  > outputfile
    

    But I would prefer Lev's solution as it starts an external process only once.

    If You want to solve it in pure bash, you could try this:

    ID=($(<IDfile))
    
    while read; do
       for((i=0;i<${#ID[*]};++i)) {
           [[ $REPLY =~ [[:space:]]${ID[$i]}$ ]] && echo $REPLY && break
       }
    done <RefFile >outputfile
    
    cat outputfile
    

    Output:

    ENSG00000199537 SNORD115-40
    ENSG00000207793 MIR432
    ENSG00000207447 RNU6-2
    

    Newer bash supports associative arrays. It can be used to simplify and speed up the search for a key:

    declare -A ID
    for i in $(<IDfile); { ID[$i]=1;}
    
    while read v; do
       [[ $v =~ [[:space:]]([^[:space:]]+)$ && ${ID[${BASH_REMATCH[1]}]} = 1 ]] && echo $v
    done <RefFile
    
    0 讨论(0)
  • 2021-01-06 11:28

    This will do the trick:

    $ awk 'NR==FNR{a[$0];next}$NF in a{print}' input reference
    ENSG00000199537 SNORD115-40
    ENSG00000207793 MIR432
    ENSG00000207447 RNU6-2
    
    0 讨论(0)
提交回复
热议问题