Remove line breaks in a FASTA file

前端未结

关注

 9  1240

I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:

>accession1
ATGGCC


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-05 01:54
              
            
            
                                                                       
You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta


Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2020-12-05 02:01
              
            
            
                                                                       
The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

 awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file


Explanation:

For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-12-05 02:06
              
            
            
                                                                       
Use this Perl one-liner, which does all of the common reformatting that is necessary in this and similar cases: removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. Note that unlike some of the other answers, this properly handles leading and trailing whitespace/newlines in the file:
# Create the input for testing:

cat > test_unwrap_in.fa <<EOF

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT

ACGT

ACGT

>seq3 without blanks or newlines
ACGTACGTACGT

EOF

# Reformat with Perl:

perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' test_unwrap_in.fa > test_unwrap_out.fa

Output:
>seq1 with blanks
ACGTACGTACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT

The Perl one-liner uses these command line flags:

-e : Tells Perl to look for code in-line, instead of in a file.

-n : Loop over the input one line at a time, assigning it to $_ by default.
chomp : Remove the input line separator (\n on *NIX).

if ( /^>/ ) : Test if the current line is a sequence header line.

$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.

END { print "\n"; } : Print the final newline after the last sequence.

s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复