问题
In short:
how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed
?
A fasta format is like this:
>sequence1
AATCG
GG-AT
>sequence2
AGTCG
GGGAT
The number of lines of a sequence may vary.
I want to convert it to this:
sequence1 AATCG GG-AT
sequence2 AGTCG GGGAT
My question seems simple, but I am lacking a real understanding of the advanced commands in sed
, the multiline commands and the commands using the hold buffer.
Here is the implementation idea I had: fill the pattern space with sequence, and only print it when a new sequence label is encountered. To do this, I would:
- Search lines matching
^>
. If found:- print the previous pattern space
- append line to pattern space
- if
^>
not found:- append line to pattern space
I read this great manual, but I am still unsure about a few things, mostly the difference between the capitalized and little letters:
- when you use P instead of p: does it print the first line of the pattern space (in file order)? I am confused by the use of "up to the next newline".
- do I have to use a loop to read lines until the next sequence name, or are the multiline commands sufficient?
- do I have to use the hold space in this example?
I know python, perl and awk and I think they would be more "human-friendly" tools to achieve this, but I want to learn some advanced sed.
Nothing I tried worked now, but here are some pieces:
This script uses the line numbers, not trying to do pattern matching. It shoes what I want to do, and now I need to automate it using match addresses:
#!/bin/sed -nf
1h
2,3H
4{x; s/\n/ /g; p}
5H
6{H;x; s/\n/ /g; p}
sed -nf fa2phy.sed my.fasta
returns the expected output.
回答1:
With sed
sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;s/\(.*\)\n.*/\1/p;x;s/.*\n//;bA' infile
回答2:
Following simple awk could help you in same.
Solution 1st:
awk '/^>/{sub(/>/,"");if(val){print val, val2};val=$0;val2="";next} {val2=val2?val2 FS $0:$0} END{print val, val2}' Input_file
Solution 2nd:
awk -v RS=">" -v FS="\n" '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":" ")}}' Input_file
Solution 3rd:
awk -v RS=">" '{gsub(/\n/," ");} NF' Input_file
回答3:
Alright, I believe I managed to answer my own question.
Here is the script I made: fa2phy.sed
:
#!/bin/sed -nf
:readseq
${H;b out} # if last line, append to hold, and goto 'out'
1{h;n;b readseq} # if first, overwrite hold, and start again at 'readseq'
/^>/!{H; n; b readseq} # if not a sequence label, append to hold, read next line, start again at 'readseq'. Else, it continues to 'out'
:out
x; # exchange hold content with pattern content
s/^>//; # substitute the starting '>'
s/\n/ /g; # substitute each newline with 2 spaces
p; # print pattern buffer
Although it works, if someone has a shorter or clearer solution, enlighten me! :)
来源:https://stackoverflow.com/questions/46908372/sed-convert-multiline-bloc-to-single-line-ex-fasta-to-phylip-format