I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
>>> import re
>>> seq="AGCCTCCCATGATTGAACAGATCAT"
>>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> seq_re.findall(genome) # list of matches
[]
>>> seq_re.search(genome) # None if not found, otherwise a match object
This one stops a the first match, so may be a bit faster when there are multiple matches
>>> print "found" if any(seq_re.finditer(genome)) else "not found"
not found
>>> print "found" if seq_re.search(genome) else "not found"
not found
>>> seq="CAT"
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> print "found" if seq_re.search(genome) else "not found"
found
for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.
You can always start implementing a more efficient algorithm while this one is running :)
If you should wish to search for single dropped or added elements change the regexp to this
>>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))
This hints of the longest common subsequence problem. The problem with string similarity here is that you need to test against a continuous string of 230000 sequences; so if you are comparing one of your 25 sequences to the continuous string you'll get a very low similarity.
If you compute the longest common subsequence between your 25 sequences and the continuous string, you'll know if it is in the string if the lengths are the same.
You could make a trie out of all the different sequences that you want to match. Now is the tricky part of making a depth first search function down the trie that allows at most one mismatch.
The advantage of this method is that you are searching through all of the sequences at once. This will save you a lot of comparisons. For instance, when you start at the top node and go down the 'A' branch, you have just saved yourself many thousands of comparisons because have just instantly matched with all sequences that start with 'A'. For a different argument, consider a slice of the genome that matches exactly with a given sequence. If you have a different sequence in your list of sequences that differs only in the last symbol, then using a trie has just saved you 23 comparison operations.
Here is one way of implementing this. Convert 'A','C',T',G' to 0,1,2,3 or a variant of that. Then use tuples of tuples as your structure for your trie. At each node, the first element in your array corresponds with 'A', the second with 'C' and so on. If 'A' is a branch of this node, then there is another tuple of 4 elements as the first item of this node's tuple. If there isn't an 'A' branch, then set the first item to 0. At the bottom of the trie are nodes that have the id of that sequence so that it can be put into the list of matches.
Here are recursive search functions allowing one mismatch for this sort of trie:
def searchnomismatch(node,genome,i):
if i == 25:
addtomatches(node)
else:
for x in range(4):
if node[x]:
if x == genome[i]:
searchnomismatch(node[x],genome,i+1)
else:
searchmismatch(node[x],genome,i+1,i)
def searchmismatch(node,genome,i,where):
if i == 25:
addtomatches(node,where)
else:
if node[genome[i]]:
searchmismatch(node[genome[i]],genome,i+1,where)
Here, I start out the search with something like
searchnomismatch(trie,genome[ind:ind+25],0)
and addtomatches is something similar to
def addtomatches(id,where=-1):
matches.append(id,where)
where equal to -1 means there wasn't a mismatch. Anyway, I hope that I was clear enough so that you get the picture.
You might find the various routines in Python-Levenshtein of some use.
Before you read on, have you looked at biopython?
It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1.
If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match:
any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome)))
but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does).
You can overcome (1) with a function like this:
def Hamming_check_0_or_1(genome, posn, sequence):
errors = 0
for i in xrange(25):
if genome[posn+i] != sequence[i]:
errors += 1
if errors >= 2:
return errors
return errors
Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed.
Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4.
When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language.
Update: Working code for a bit-parallel method
I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case.
The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C.
Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem.
Details from this technical report. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below.
# coding: ascii
from collections import defaultdict
import re
_DEBUG = 0
# "Fast Text Searching with Errors" by Sun Wu and Udi Manber
# TR 91-11, Dept of Computer Science, University of Arizona, June 1991.
# http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854
def WM_approx_Ham1_search(pattern, text):
"""Generate (Hamming_dist, start_offset)
for matches with distance 0 or 1"""
m = len(pattern)
S_table = defaultdict(int)
for i, c in enumerate(pattern):
S_table[c] |= 1 << i
R0 = 0
R1 = 0
mask = 1 << (m - 1)
for j, c in enumerate(text):
S = S_table[c]
shR0 = (R0 << 1) | 1
R0 = shR0 & S
R1 = ((R1 << 1) | 1) & S | shR0
if _DEBUG:
print "j= %2d msk=%s S=%s R0=%s R1=%s" \
% tuple([j] + map(bitstr, [mask, S, R0, R1]))
if R0 & mask: # exact match
yield 0, j - m + 1
elif R1 & mask: # match with one substitution
yield 1, j - m + 1
if _DEBUG:
def bitstr(num, mlen=8):
wstr = ""
for i in xrange(mlen):
if num & 1:
wstr = "1" + wstr
else:
wstr = "0" + wstr
num >>= 1
return wstr
def Ham_dist(s1, s2):
"""Calculate Hamming distance between 2 sequences."""
assert len(s1) == len(s2)
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
def long_check(pattern, text):
"""Naively and understandably generate (Hamming_dist, start_offset)
for matches with distance 0 or 1"""
m = len(pattern)
for i in xrange(len(text) - m + 1):
d = Ham_dist(pattern, text[i:i+m])
if d < 2:
yield d, i
def Paul_McGuire_regex(pattern, text):
searchSeqREStr = (
'('
+ pattern
+ ')|('
+ ')|('.join(
pattern[:i]
+ "[ACTGN]".replace(c,'')
+ pattern[i+1:]
for i,c in enumerate(pattern)
)
+ ')'
)
searchSeqRE = re.compile(searchSeqREStr)
for match in searchSeqRE.finditer(text):
locn = match.start()
dist = int(bool(match.lastindex - 1))
yield dist, locn
if __name__ == "__main__":
genome1 = "TTTACGTAAACTAAACTGTAA"
# 01234567890123456789012345
# 1 2
tests = [
(genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"),
("T" * 10, "TTTT"),
("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match
]
nfailed = 0
for genome, patterns in tests:
print "genome:", genome
for pattern in patterns.split():
print pattern
a1 = list(WM_approx_Ham1_search(pattern, genome))
a2 = list(long_check(pattern, genome))
a3 = list(Paul_McGuire_regex(pattern, genome))
print a1
print a2
print a3
print a1 == a2, a2 == a3
nfailed += (a1 != a2 or a2 != a3)
print "***", nfailed
I tried some of the solutions, but as already written they are slow when dealing with a large amount of sequences (strings).
I came up with using bowtie and mapping the substring of interest (soi) against a reference file which contains the strings in FASTA format. You can provide the number of allowed mismatches (0..3) and you get back the strings to which the soi mapped given the allowed mismatches. This works well and pretty fast.