Extract sequences from a FASTA file based on entries in a separate file

前端未结

关注

 1  1458

I have two files.

File 1: a FASTA file with gene sequences, formated like this example:

>PITG_00002 | Phytophthora infestans T30-4 conserved hypo


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2021-01-01 08:28
              
            
            
                                                                       
Try this:    

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()


To briefly explain what's going on here... accessionids.txt is your File 2, whereas fasta.txt is your File 1. Obviously you'll need to replace these filenames with your actual filenames within the code.

First, we create a dictionary (sometimes referred to as a hash or associative array) and for every Accession ID in File 2 we create an entry where the key is the Accession ID and the value is set to 1 (not that the value really matters in this case).

Next we look in File 1 and again look at each line in that file. If the line in the file starts with > then we know that it contains an Accession ID. We take that line and split it along the | since every line with an Accession ID will have a | in the string. Next, take the first part of the split as specified by _splitline[0]. We use accessorIDWithArrow[1:-1] to chop off the first and last characters in the string which are the > symbol in the front and a blank space in the rear. 

At this point, accessorID now contains the Accession ID in the format that we expect from File 2.

Next, we check if the dictionary we created and populated earlier has this Accession ID defined as a key. If it does, we immediately write the line with the Accession ID to a new file, fasta_parsed.txt, and set/reset the skip 'flag' variable to 0. The else statement containing the if not skip segment will then allow subsequent lines associated with the Accession ID that we found to be printed to the fasta_parsed.txt file. 

For Accession ID from File 1 not found in the dictionary (not in File 2), we don't write the line to fasta_parsed.txt and we set the skip flag to 0. Thus, until another Accession ID is found in File 1 that exists in File 2, all subsequent lines will be skipped.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复