I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
I thought the code below is simple and convenient.
in_pattern = "";
in_genome = "";
in_mistake = d;
out_result = ""
kmer = len(in_pattern)
def FindMistake(v):
mistake = 0
for i in range(0, kmer, 1):
if (v[i]!=in_pattern[i]):
mistake+=1
if mistake>in_mistake:
return False
return True
for i in xrange(len(in_genome)-kmer+1):
v = in_genome[i:i+kmer]
if FindMistake(v):
out_result+= str(i) + " "
print out_result
You can easily insert the genomes and segments you want to check and also set up the value of mismatch.