In bash (4.3.46(1)) I have some multi-line so called fasta records where each record is initiated by on line with >name and the following lines DNA sequence ([AGCTNacgtn]),
With GNU grep:
grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file | grep .
Output:
>chr2 TTGNACACCC TGGGGGAGTA
Try this -
grep 'chr2' -A 2 file
>chr2
TTGNACACCC
TGGGGGAGTA
You should install the FAST perl package. It contains many utilities directly usable from the shell for dealing with fasta
files, like fashead or fastail (and much more)
after installing it is simple as:
fashead -n2 fastafile | fastail -n1
output
>chr2
TTGNA.....
or even simpler
fasgrep chr2 fastafile
with the same output...
Created sedgrep mixed version to support in generic way... You could use this sedgrep shell command available at https://github.com/iamdvr/sedgrep-shell-util
Direct Link: https://github.com/iamdvr/sedgrep-shell-util/blob/main/sedgrep
For your case direct command is this...
cat <FILE_NAME> | sed -nr ':main; /^>.*chr2/ { :loop; p; n; /^>/ b main; b loop} '
sedgrep usage is as follows...
Default NEW_LINE_PATTERN is ^\[
Usage :
cat {INPUT_FILE_NAME} | sedgrep {NEW_LINE_PATTERN} {THREAD_OR_SEARCH_PATTERN}
cat {INPUT_FILE_NAME} | sedgrep {THREAD_OR_SEARCH_PATTERN}
sedgrep {NEW_LINE_PATTERN} {THREAD_OR_SEARCH_PATTERN} {INPUT_FILE_NAME}
sedgrep {THREAD_OR_SEARCH_PATTERN} {INPUT_FILE_NAME}
Example :
cat sampleInput.log | sedgrep 2016-05-23 DB_CONN
cat sampleInput.log | sedgrep DB_CONN
sedgrep 2016-05-23 DB_CONN sampleInput.log
sedgrep DB_CONN sampleInput.log
You can use awk
with custom RS
:
awk -v n=2 -v RS='(^|\n)>' 'NR==n+1{print ">" $0}' file
>chr2
TTGNACACCC
TGGGGGAGTA
The best tool for working with multi-line records is awk
.
In your case:
awk 'BEGIN{RS=">"} NR==2 {print RS$0}' input.txt
>chr1
AGCTACTTTT
AGGGNGGTNN
>chr2
TTGNACACCC
TGGGGGAGTA
>chr3
TGACGTGGGT
TCGGGTTTTT
BEGIN{RS=">"}
Initially set record separator to ">"
NR==2
filter for record #2 only
{print RS$0}
print record #2 with the missing record separator back