I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.
import tre
pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
data = """
In addition to fundamental contributions in several branches of
theoretical computer science, Donnald Erwin Kuth is the creator of
the TeX computer typesetting system, the related METAFONT font
definition language and rendering system, and the Computer Modern
family of typefaces.
"""
fz = tre.Fuzzyness(maxerr = 3)
print fz
m = pt.search(data, fz)
if m:
print m.groups()
print m[0]
which will output
tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
((95, 113), (99, 108), (102, 108))
Donnald Erwin Kuth
http://en.wikipedia.org/wiki/TRE_%28computing%29
http://laurikari.net/tre/
This is quite old but perhaps this simple solution could work. loop through the sequence taking 25character slices. convert the slice to an numpy array. Compare to the 25char string (also as a numpy array). Sum the answer and if the answer is 24 print out the position in the loop and the mismatch.
te next few lines show it working
import numpy as np
a = ['A','B','C']
b = np.array(a)
b
array(['A', 'B', 'C'], dtype='
c = ['A','D','C']
d = np.array(c)
b==d
array([ True, False, True])
sum(b==d)
2
I guess this may come a bit late, but there is a tool named PatMaN that does exactly what you want: http://bioinf.eva.mpg.de/patman/
You could use Pythons built in capability to do the search with regular expression matching.
re module in python http://docs.python.org/library/re.html
regular expression primer http://www.regular-expressions.info/
I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:
fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read() # yes! just read the whole dang 158Mb!
"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") +
"length=" + integer("genelen") + LineEnd() +
Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))
# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)
(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)
Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:
class CloseMatch(Token):
def __init__(self, seq, maxMismatches=1):
super(CloseMatch,self).__init__()
self.name = seq
self.sequence = seq
self.maxMismatches = maxMismatches
self.errmsg = "Expected " + self.sequence
self.mayIndexError = False
self.mayReturnEmpty = False
def parseImpl( self, instring, loc, doActions=True ):
start = loc
instrlen = len(instring)
maxloc = start + len(self.sequence)
if maxloc <= instrlen:
seq = self.sequence
seqloc = 0
mismatches = []
throwException = False
done = False
while loc < maxloc and not done:
if instring[loc] != seq[seqloc]:
mismatches.append(seqloc)
if len(mismatches) > self.maxMismatches:
throwException = True
done = True
loc += 1
seqloc += 1
else:
throwException = True
if throwException:
exc = self.myException
exc.loc = loc
exc.pstr = instring
raise exc
return loc, (instring[start:loc],mismatches)
For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)
I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):
searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
print "%s (%d)" % (g.id, g.genelen)
print "-"*24
for t,startLoc,endLoc in searchseq.scanString(g.gene):
matched, mismatches = t[0]
print "MATCH:", searchseq.sequence
print "FOUND:", matched
if mismatches:
print " ", ''.join(' ' if i not in mismatches else '*'
for i,c in enumerate(searchseq.sequence))
else:
print "<exact match>"
print "at location", startLoc
print
print
I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.
This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:
scf_1104442825154 (964)
------------------------
scf_1104442822828 (942)
------------------------
scf_1104442824510 (987)
------------------------
scf_1104442823180 (1065)
------------------------
...
I was getting discouraged, not to see any matches until:
scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
* *
at location 33
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 175
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 474
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 617
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
* *
at location 718
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
* *
at location 896
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
* *
at location 945
And finally my exact match at:
scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
* *
at location 177
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
*
at location 203
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
* *
at location 350
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
* *
at location 523
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
* *
at location 822
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
* *
at location 969
So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.
For comparison, here is an RE-based version, that finds 1-mismatch matches only:
import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
'|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:]
for i,c in enumerate(seqStr))
searchSeqRE = re.compile(searchSeqREStr)
for g in genedata:
print "%s (%d)" % (g.id, g.genelen)
print "-"*24
for match in searchSeqRE.finditer(g.gene):
print "MATCH:", seqStr
print "FOUND:", match.group(0)
print "at location", match.start()
print
print
(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)
So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.
Python regex library supports fuzzy regular expression matching. One advantage over TRE is that it allows to find all matches of regular expression in the text (supports overlapping matches as well).
import regex
m=regex.findall("AA", "CAG")
>>> []
m=regex.findall("(AA){e<=1}", "CAAG") # means allow up to 1 error
m
>>> ['CA', 'AG']