I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:
>accession1
ATGGCC
Do not reinvent the wheel. If the goal is simply removing newlines in multi-line fasta file (unwrapping fasta file), use any of the specialized bioinformatics tools, for example seqtk, like so:
seqtk seq -l 0 input_file
Example:
# Create the input for testing:
cat > test_unwrap_in.fa <<EOF
>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT
ACGT
ACGT
>seq3 without blanks or newlines
ACGTACGTACGT
EOF
# Unwrap lines:
seqtk seq -l 0 test_unwrap_in.fa > test_unwrap_out.fa
cat test_unwrap_out.fa
Output:
>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT
To install seqtk
, you can use for example conda install seqtk
.
SEE ALSO:
seqtk
usage:
seqtk seq
Usage: seqtk seq [options] <in.fq>|<in.fa>
Options: ...
-l INT number of residues per line; 0 for 2^32-1 [0]
This awk
program:
% awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta
Will yield:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
On lines that don't start with a >
, print the line without a line break and store a newline character (in variable n
) for later.
On lines that do start with a >
, print the stored newline character (if any) and the line. Reset n
, in case this is the last line.
End with a newline, if required.
By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.
--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide
I would use sed
for this. Using GNU sed
:
sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file
Results:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
Create a label, a
. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >
, perform the substitution s/\n\([^>]\)/\1/
. If the substitution was successful since the last input line was read, then branch to label a
. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
There have been great responses so far.
Here is an efficient way to do this in Python:
def read_fasta(fasta):
with open(fasta, 'r') as fast:
headers, sequences = [], []
for line in fast:
if line.startswith('>'):
head = line.replace('>','').strip()
headers.append(head)
sequences.append('')
else :
seq = line.strip()
if len(seq) > 0:
sequences[-1] += seq
return (headers, sequences)
def write_fasta(headers, sequences, fasta):
with open(fasta, 'w') as fast:
for i in range(len(headers)):
fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')
You can use the above functions to retrieve sequences/headers from a fasta file without line breaks, manipulate them, and write back to a fasta file.
headers, sequences = read_fasta('input.fasta')
new_headers = do_something(headers)
new_sequences = do_something(sequences)
write_fasta(new_headers, new_sequences, 'input.fasta')
There is another awk one-liner, should work for your case.
awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
Another variation :-)
awk '!/>/{printf( "%s", $0);next}
NR>1{printf( "\n")}
END {printf"\n"}
7' YourFile