Remove line breaks in a FASTA file

前端 未结 9 1211
予麋鹿
予麋鹿 2020-12-05 01:26

I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:

>accession1
ATGGCC         


        
相关标签:
9条回答
  • 2020-12-05 01:40

    Do not reinvent the wheel. If the goal is simply removing newlines in multi-line fasta file (unwrapping fasta file), use any of the specialized bioinformatics tools, for example seqtk, like so:

    seqtk seq -l 0 input_file
    

    Example:

    # Create the input for testing:
    
    cat > test_unwrap_in.fa <<EOF
    
    >seq1 with blanks
    ACGT ACGT ACGT
    >seq2 with newlines
    ACGT
    
    ACGT
    
    ACGT
    
    >seq3 without blanks or newlines
    ACGTACGTACGT
    
    EOF
    
    # Unwrap lines:
    
    seqtk seq -l 0 test_unwrap_in.fa > test_unwrap_out.fa
    
    cat test_unwrap_out.fa
    

    Output:

    >seq1 with blanks
    ACGT ACGT ACGT
    >seq2 with newlines
    ACGTACGTACGT
    >seq3 without blanks or newlines
    ACGTACGTACGT
    

    To install seqtk, you can use for example conda install seqtk.

    SEE ALSO:

    seqtk usage:

    seqtk seq
    
    Usage:   seqtk seq [options] <in.fq>|<in.fa>
    
    Options: ...
             -l INT    number of residues per line; 0 for 2^32-1 [0]
    
    0 讨论(0)
  • 2020-12-05 01:41

    This awk program:

    % awk '!/^>/ { printf "%s", $0; n = "\n" } 
    /^>/ { print n $0; n = "" }
    END { printf "%s", n }
    ' input.fasta
    

    Will yield:

    >accession1
    ATGGCCCATGGGATCCTAGC
    >accession2
    GATATCCATGAAACGGCTTA
    

    Explanation:

    On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

    On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

    End with a newline, if required.

    Note:

    By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.

    --6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

    0 讨论(0)
  • 2020-12-05 01:41

    I would use sed for this. Using GNU sed:

    sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file
    

    Results:

    >accession1
    ATGGCCCATGGGATCCTAGC
    >accession2
    GATATCCATGAAACGGCTTA
    

    Explanation:

    Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

    0 讨论(0)
  • 2020-12-05 01:52

    There have been great responses so far.

    Here is an efficient way to do this in Python:

    def read_fasta(fasta):
        with open(fasta, 'r') as fast:
            headers, sequences = [], []
            for line in fast:
                if line.startswith('>'):
                    head = line.replace('>','').strip()
                    headers.append(head)
                    sequences.append('')
                else :
                    seq = line.strip()
                    if len(seq) > 0:
                        sequences[-1] += seq
        return (headers, sequences)
    
    
    def write_fasta(headers, sequences, fasta):
        with open(fasta, 'w') as fast:
            for i in range(len(headers)):
                fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')
    

    You can use the above functions to retrieve sequences/headers from a fasta file without line breaks, manipulate them, and write back to a fasta file.

    headers, sequences = read_fasta('input.fasta')
    new_headers = do_something(headers)
    new_sequences = do_something(sequences)
    write_fasta(new_headers, new_sequences, 'input.fasta')
    
    0 讨论(0)
  • 2020-12-05 01:53

    There is another awk one-liner, should work for your case.

    awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
    
    0 讨论(0)
  • 2020-12-05 01:53

    Another variation :-)

    awk '!/>/{printf( "%s", $0);next}
         NR>1{printf( "\n")} 
         END {printf"\n"}
         7' YourFile
    
    0 讨论(0)
提交回复
热议问题