Extract sequences from a FASTA file based on entries in a separate file

前端 未结 1 1458
轻奢々
轻奢々 2021-01-01 08:13

I have two files.

File 1: a FASTA file with gene sequences, formated like this example:

>PITG_00002 | Phytophthora infestans T30-4 conserved hypo         


        
相关标签:
1条回答
  • 2021-01-01 08:28

    Try this:

    f2 = open('accessionids.txt','r')
    f1 = open('fasta.txt','r')
    f3 = open('fasta_parsed.txt','w')
    
    AI_DICT = {}
    for line in f2:
        AI_DICT[line[:-1]] = 1
    
    skip = 0
    for line in f1:
        if line[0] == '>':
            _splitline = line.split('|')
            accessorIDWithArrow = _splitline[0]
            accessorID = accessorIDWithArrow[1:-1]
            # print accessorID
            if accessorID in AI_DICT:
                f3.write(line)
                skip = 0
            else:
                skip = 1
        else:
            if not skip:
                f3.write(line)
    
    f1.close()
    f2.close()
    f3.close()
    

    To briefly explain what's going on here... accessionids.txt is your File 2, whereas fasta.txt is your File 1. Obviously you'll need to replace these filenames with your actual filenames within the code.

    First, we create a dictionary (sometimes referred to as a hash or associative array) and for every Accession ID in File 2 we create an entry where the key is the Accession ID and the value is set to 1 (not that the value really matters in this case).

    Next we look in File 1 and again look at each line in that file. If the line in the file starts with > then we know that it contains an Accession ID. We take that line and split it along the | since every line with an Accession ID will have a | in the string. Next, take the first part of the split as specified by _splitline[0]. We use accessorIDWithArrow[1:-1] to chop off the first and last characters in the string which are the > symbol in the front and a blank space in the rear.

    At this point, accessorID now contains the Accession ID in the format that we expect from File 2.

    Next, we check if the dictionary we created and populated earlier has this Accession ID defined as a key. If it does, we immediately write the line with the Accession ID to a new file, fasta_parsed.txt, and set/reset the skip 'flag' variable to 0. The else statement containing the if not skip segment will then allow subsequent lines associated with the Accession ID that we found to be printed to the fasta_parsed.txt file.

    For Accession ID from File 1 not found in the dictionary (not in File 2), we don't write the line to fasta_parsed.txt and we set the skip flag to 0. Thus, until another Accession ID is found in File 1 that exists in File 2, all subsequent lines will be skipped.

    0 讨论(0)
提交回复
热议问题