grep (bash) multi-line pattern

后端 未结 6 383
感情败类
感情败类 2021-01-15 01:14

In bash (4.3.46(1)) I have some multi-line so called fasta records where each record is initiated by on line with >name and the following lines DNA sequence ([AGCTNacgtn]),

相关标签:
6条回答
  • 2021-01-15 01:52

    With GNU grep:

    grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file | grep .
    

    Output:

    >chr2
    TTGNACACCC
    TGGGGGAGTA
    
    0 讨论(0)
  • 2021-01-15 01:54

    Try this -

    grep 'chr2' -A 2 file
    >chr2
    TTGNACACCC
    TGGGGGAGTA
    
    0 讨论(0)
  • 2021-01-15 01:58

    You should install the FAST perl package. It contains many utilities directly usable from the shell for dealing with fasta files, like fashead or fastail (and much more)

    after installing it is simple as:

    fashead -n2 fastafile | fastail -n1
    

    output

    >chr2
    TTGNA.....
    

    or even simpler

    fasgrep chr2 fastafile
    

    with the same output...

    0 讨论(0)
  • 2021-01-15 02:02

    Created sedgrep mixed version to support in generic way... You could use this sedgrep shell command available at https://github.com/iamdvr/sedgrep-shell-util

    Direct Link: https://github.com/iamdvr/sedgrep-shell-util/blob/main/sedgrep

    For your case direct command is this...

    cat <FILE_NAME> | sed -nr ':main; /^>.*chr2/ { :loop; p; n; /^>/ b main; b loop} '
    

    sedgrep usage is as follows...

    Default NEW_LINE_PATTERN is ^\[
    Usage : 
        cat {INPUT_FILE_NAME}  | sedgrep  {NEW_LINE_PATTERN} {THREAD_OR_SEARCH_PATTERN} 
        cat {INPUT_FILE_NAME}  | sedgrep  {THREAD_OR_SEARCH_PATTERN} 
        sedgrep {NEW_LINE_PATTERN} {THREAD_OR_SEARCH_PATTERN} {INPUT_FILE_NAME}
        sedgrep {THREAD_OR_SEARCH_PATTERN} {INPUT_FILE_NAME}
    Example : 
        cat sampleInput.log | sedgrep 2016-05-23 DB_CONN
        cat sampleInput.log | sedgrep DB_CONN
        sedgrep 2016-05-23 DB_CONN sampleInput.log
        sedgrep DB_CONN sampleInput.log
    
    0 讨论(0)
  • 2021-01-15 02:07

    You can use awk with custom RS:

    awk -v n=2 -v RS='(^|\n)>' 'NR==n+1{print ">" $0}' file    
    >chr2
    TTGNACACCC
    TGGGGGAGTA
    
    0 讨论(0)
  • 2021-01-15 02:07

    The best tool for working with multi-line records is awk.

    In your case:

    awk 'BEGIN{RS=">"} NR==2 {print RS$0}' input.txt
    

    input.txt

    >chr1
    AGCTACTTTT
    AGGGNGGTNN
    >chr2
    TTGNACACCC
    TGGGGGAGTA
    >chr3
    TGACGTGGGT
    TCGGGTTTTT
    

    Explanation:

    BEGIN{RS=">"} Initially set record separator to ">"

    NR==2 filter for record #2 only

    {print RS$0} print record #2 with the missing record separator back

    0 讨论(0)
提交回复
热议问题