Remove line breaks in a FASTA file

前端 未结 9 1212
予麋鹿
予麋鹿 2020-12-05 01:26

I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:

>accession1
ATGGCC         


        
相关标签:
9条回答
  • 2020-12-05 01:54

    You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

    bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta
    

    Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

    0 讨论(0)
  • 2020-12-05 02:01

    The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

     awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file
    

    Explanation:

    For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

    Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

    awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file
    
    0 讨论(0)
  • 2020-12-05 02:06

    Use this Perl one-liner, which does all of the common reformatting that is necessary in this and similar cases: removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. Note that unlike some of the other answers, this properly handles leading and trailing whitespace/newlines in the file:

    # Create the input for testing:
    
    cat > test_unwrap_in.fa <<EOF
    
    >seq1 with blanks
    ACGT ACGT ACGT
    >seq2 with newlines
    ACGT
    
    ACGT
    
    ACGT
    
    >seq3 without blanks or newlines
    ACGTACGTACGT
    
    EOF
    
    # Reformat with Perl:
    
    perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' test_unwrap_in.fa > test_unwrap_out.fa
    

    Output:

    >seq1 with blanks
    ACGTACGTACGT
    >seq2 with newlines
    ACGTACGTACGT
    >seq3 without blanks or newlines
    ACGTACGTACGT
    

    The Perl one-liner uses these command line flags:
    -e : Tells Perl to look for code in-line, instead of in a file.
    -n : Loop over the input one line at a time, assigning it to $_ by default.

    chomp : Remove the input line separator (\n on *NIX).
    if ( /^>/ ) : Test if the current line is a sequence header line.
    $n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
    END { print "\n"; } : Print the final newline after the last sequence.
    s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.

    0 讨论(0)
提交回复
热议问题