I have two files.
File 1: a FASTA file with gene sequences, formated like this example:
>PITG_00002 | Phytophthora infestans T30-4 conserved hypo
Try this:
f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')
AI_DICT = {}
for line in f2:
AI_DICT[line[:-1]] = 1
skip = 0
for line in f1:
if line[0] == '>':
_splitline = line.split('|')
accessorIDWithArrow = _splitline[0]
accessorID = accessorIDWithArrow[1:-1]
# print accessorID
if accessorID in AI_DICT:
f3.write(line)
skip = 0
else:
skip = 1
else:
if not skip:
f3.write(line)
f1.close()
f2.close()
f3.close()
To briefly explain what's going on here... accessionids.txt
is your File 2, whereas fasta.txt
is your File 1. Obviously you'll need to replace these filenames with your actual filenames within the code.
First, we create a dictionary (sometimes referred to as a hash or associative array) and for every Accession ID in File 2 we create an entry where the key is the Accession ID and the value is set to 1 (not that the value really matters in this case).
Next we look in File 1 and again look at each line in that file. If the line in the file starts with >
then we know that it contains an Accession ID. We take that line and split it along the |
since every line with an Accession ID will have a |
in the string. Next, take the first part of the split as specified by _splitline[0]
. We use accessorIDWithArrow[1:-1]
to chop off the first and last characters in the string which are the >
symbol in the front and a blank space in the rear.
At this point, accessorID
now contains the Accession ID in the format that we expect from File 2.
Next, we check if the dictionary we created and populated earlier has this Accession ID defined as a key. If it does, we immediately write the line with the Accession ID to a new file, fasta_parsed.txt
, and set/reset the skip
'flag' variable to 0
. The else
statement containing the if not skip
segment will then allow subsequent lines associated with the Accession ID that we found to be printed to the fasta_parsed.txt
file.
For Accession ID from File 1 not found in the dictionary (not in File 2), we don't write the line to fasta_parsed.txt
and we set the skip
flag to 0. Thus, until another Accession ID is found in File 1 that exists in File 2, all subsequent lines will be skipped.