I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:
>accession1
ATGGCC
You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:
awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file
Explanation:
For lines beginning with >
, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >
, print the line without a trailing newline character. Since the last line in the file won't begin with >
, use the END
block to print a final newline character.
Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >
. Try:
awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file
Use this Perl one-liner, which does all of the common reformatting that is necessary in this and similar cases: removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. Note that unlike some of the other answers, this properly handles leading and trailing whitespace/newlines in the file:
# Create the input for testing:
cat > test_unwrap_in.fa <<EOF
>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT
ACGT
ACGT
>seq3 without blanks or newlines
ACGTACGTACGT
EOF
# Reformat with Perl:
perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' test_unwrap_in.fa > test_unwrap_out.fa
Output:
>seq1 with blanks
ACGTACGTACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
chomp
: Remove the input line separator (\n
on *NIX).
if ( /^>/ )
: Test if the current line is a sequence header line.
$n
: This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; }
: Print the final newline after the last sequence.
s/\s+//g; print;
: If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.