问题
I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:
>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA
I'd like to convert it into this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
I found a potential solution on this site, which looks like this:
cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
However, this places an extra line break between each entry, so file looks like this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";}
was the culprit...potentially print "\n"
is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:
awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta
However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:
{empty line}
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Anyone have a solution to get my file in the correct format? Thanks!
回答1:
This awk
program:
% awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta
Will yield:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
On lines that don't start with a >
, print the line without a line break and store a newline character (in variable n
) for later.
On lines that do start with a >
, print the stored newline character (if any) and the line. Reset n
, in case this is the last line.
End with a newline, if required.
Note:
By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.
--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide
回答2:
The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:
awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file
Explanation:
For lines beginning with >
, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >
, print the line without a trailing newline character. Since the last line in the file won't begin with >
, use the END
block to print a final newline character.
Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >
. Try:
awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file
回答3:
There is another awk one-liner, should work for your case.
awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
回答4:
I would use sed
for this. Using GNU sed
:
sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file
Results:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
Create a label, a
. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >
, perform the substitution s/\n\([^>]\)/\1/
. If the substitution was successful since the last input line was read, then branch to label a
. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
回答5:
Another variation :-)
awk '!/>/{printf( "%s", $0);next}
NR>1{printf( "\n")}
END {printf"\n"}
7' YourFile
回答6:
You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
来源:https://stackoverflow.com/questions/15857088/remove-line-breaks-in-a-fasta-file