How to extract FASTA sequence using sequence ID (shell script)

问题

I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I compare two files(Kcompare.pep and clade1i.txt) and extract the sequences with the same sequence header? Can anyone help me?

Kcompare.pep

>ztr:MYCGRDRAFT_45998
MAAPLHAEGPIRTPYTGVELLNTPYLNKGTAFPADERRVLGLTALLPTSVHTLDQQLQRA
WHQYQSRDNDLARNTFLTSLKEQNEVLYYRLVLDHLSEVFSIIYTPTEGEAIQRYSSLFR
>kal:KALB_5042
MTAEVAVVSDGSAIPGASPPATLPLLQDYAELVREHAGLSAVPLAVDSARLAAELCALPK
RFRAVFLTHTDPERAFQVQRAVAKAGGPLVITDQDTTAISLTASTLTTLARRGRSPSDSR

clade1i.txt

cpo:COPRO5265_0583
ble:BleG1_3845
kal:KALB_5042

expected output

>kal:KALB_5042
MTAEVAVVSDGSAIPGASPPATLPLLQDYAELVREHAGLSAVPLAVDSARLAAELCALPK
RFRAVFLTHTDPERAFQVQRAVAKAGGPLVITDQDTTAISLTASTLTTLARRGRSPSDSR

I tried to run this but no error or result appeared.

for i  in K*
   do
     echo $i

      awk -F ' ' '{print $1}' $i/$i.pep > Kcompare.pep
      mv Kcompare.pep $i

      awk -F '_' '{print $2":"$3"_"$4}' $i/firstClade.txt > $i/clade1i.txt
       awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' $i/Kcompare.pep | awk -F"\t"  'BEGIN{while((getline k <"$i/clade1i.txt")>0)i[k]=1}{gsub("^>","",$0);if(i[$1]){print ">"$1"\n"$2}}' > $i/firsti.pep
    done

回答1:

Using awk:

awk 'NR==FNR{a[">"$0];next}/^>/{f=0;}($0 in a)||f{print;f=1}' clade1i.txt Kcompare.pep

Read the clade1i.txt file and store in an array as keys. Read the Kcompare.pep. For every line beginning with '>', set a flag, and keep printing the lines till the next line beginning with '>' is encountered.

回答2:

Use this:

while read l; do
  sed -n '/^>'"$l"'/,/^>|$/p' Kcompare.pep
done <clade1i.txt

The while loop loops trough the clade1i.txt file line by line.
sed -n suppresses auto print.
/regex/,/regex/ matches all from the first regex to the second.
p prints matched lines.

来源：https://stackoverflow.com/questions/34670874/how-to-extract-fasta-sequence-using-sequence-id-shell-script

标签

shell

fasta