sed convert multiline bloc to single line (ex: fasta to phylip format)

大兔子大兔子 提交于 2019-12-11 08:19:53


In short:

how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed ?

A fasta format is like this:


The number of lines of a sequence may vary.

I want to convert it to this:

sequence1 AATCG GG-AT
sequence2 AGTCG GGGAT

My question seems simple, but I am lacking a real understanding of the advanced commands in sed, the multiline commands and the commands using the hold buffer.

Here is the implementation idea I had: fill the pattern space with sequence, and only print it when a new sequence label is encountered. To do this, I would:

  1. Search lines matching ^>. If found:
    • print the previous pattern space
    • append line to pattern space
  2. if ^> not found:
    • append line to pattern space

I read this great manual, but I am still unsure about a few things, mostly the difference between the capitalized and little letters:

  • when you use P instead of p: does it print the first line of the pattern space (in file order)? I am confused by the use of "up to the next newline".
  • do I have to use a loop to read lines until the next sequence name, or are the multiline commands sufficient?
  • do I have to use the hold space in this example?

I know python, perl and awk and I think they would be more "human-friendly" tools to achieve this, but I want to learn some advanced sed.

Nothing I tried worked now, but here are some pieces:

This script uses the line numbers, not trying to do pattern matching. It shoes what I want to do, and now I need to automate it using match addresses:

#!/bin/sed -nf
4{x; s/\n/ /g; p}
6{H;x; s/\n/ /g; p}

sed -nf fa2phy.sed my.fasta returns the expected output.


With sed

sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;s/\(.*\)\n.*/\1/p;x;s/.*\n//;bA' infile


Following simple awk could help you in same.

Solution 1st:

awk '/^>/{sub(/>/,"");if(val){print val, val2};val=$0;val2="";next} {val2=val2?val2 FS $0:$0} END{print val, val2}'  Input_file

Solution 2nd:

awk -v RS=">" -v FS="\n" '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":" ")}}'   Input_file

Solution 3rd:

awk -v RS=">" '{gsub(/\n/," ");} NF'   Input_file


Alright, I believe I managed to answer my own question.

Here is the script I made: fa2phy.sed:

#!/bin/sed -nf

${H;b out}              # if last line, append to hold, and goto 'out'
1{h;n;b readseq}        # if first, overwrite hold, and start again at 'readseq'
/^>/!{H; n; b readseq}  # if not a sequence label, append to hold, read next line, start again at 'readseq'. Else, it continues to 'out'

x;         # exchange hold content with pattern content
s/^>//;    # substitute the starting '>'
s/\n/  /g; # substitute each newline with 2 spaces
p;         # print pattern buffer

Although it works, if someone has a shorter or clearer solution, enlighten me! :)

