Search for string allowing for one mismatch in any location of the string

后端未结

关注

 13  900

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi

相关标签:

13条回答

萌比男神i

2020-11-30 03:08

You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.

import tre

pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
data = """
In addition to fundamental contributions in several branches of
theoretical computer science, Donnald Erwin Kuth is the creator of
the TeX computer typesetting system, the related METAFONT font
definition language and rendering system, and the Computer Modern
family of typefaces.
"""

fz = tre.Fuzzyness(maxerr = 3)
print fz
m = pt.search(data, fz)

if m:
    print m.groups()
    print m[0]

which will output

tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
((95, 113), (99, 108), (102, 108))
Donnald Erwin Kuth

http://en.wikipedia.org/wiki/TRE_%28computing%29

http://laurikari.net/tre/

0 讨论(0)

一生所求

2020-11-30 03:09

This is quite old but perhaps this simple solution could work. loop through the sequence taking 25character slices. convert the slice to an numpy array. Compare to the 25char string (also as a numpy array). Sum the answer and if the answer is 24 print out the position in the loop and the mismatch.

te next few lines show it working

import numpy as np

a = ['A','B','C']

b = np.array(a)

b

array(['A', 'B', 'C'], dtype='

c = ['A','D','C']

d = np.array(c)

b==d

array([ True, False, True])

sum(b==d)

2

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-11-30 03:11

I guess this may come a bit late, but there is a tool named PatMaN that does exactly what you want: http://bioinf.eva.mpg.de/patman/

0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2020-11-30 03:12

You could use Pythons built in capability to do the search with regular expression matching.

re module in python http://docs.python.org/library/re.html

regular expression primer http://www.regular-expressions.info/

0 讨论(0)
发布评论:

提交评论
- 加载中...

死守一世寂寞

2020-11-30 03:20

I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:

fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!

"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") + 
                "length=" + integer("genelen") + LineEnd() + 
                Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))

# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)

(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)

Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:

class CloseMatch(Token):
    def __init__(self, seq, maxMismatches=1):
        super(CloseMatch,self).__init__()
        self.name = seq
        self.sequence = seq
        self.maxMismatches = maxMismatches
        self.errmsg = "Expected " + self.sequence
        self.mayIndexError = False
        self.mayReturnEmpty = False

    def parseImpl( self, instring, loc, doActions=True ):
        start = loc
        instrlen = len(instring)
        maxloc = start + len(self.sequence)

        if maxloc <= instrlen:
            seq = self.sequence
            seqloc = 0
            mismatches = []
            throwException = False
            done = False
            while loc < maxloc and not done:
                if instring[loc] != seq[seqloc]:
                    mismatches.append(seqloc)
                    if len(mismatches) > self.maxMismatches:
                        throwException = True
                        done = True
                loc += 1
                seqloc += 1
        else:
            throwException = True

        if throwException:
            exc = self.myException
            exc.loc = loc
            exc.pstr = instring
            raise exc

        return loc, (instring[start:loc],mismatches)

For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)

I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):

searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for t,startLoc,endLoc in searchseq.scanString(g.gene):
        matched, mismatches = t[0]
        print "MATCH:", searchseq.sequence
        print "FOUND:", matched
        if mismatches:
            print "      ", ''.join(' ' if i not in mismatches else '*' 
                            for i,c in enumerate(searchseq.sequence))
        else:
            print "<exact match>"
        print "at location", startLoc
        print
    print

I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.

This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:

scf_1104442825154 (964)
------------------------

scf_1104442822828 (942)
------------------------

scf_1104442824510 (987)
------------------------

scf_1104442823180 (1065)
------------------------
...

I was getting discouraged, not to see any matches until:

scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
                *      *        
at location 33

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 175

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 474

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 617

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
                       *   *    
at location 718

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
                    *  *        
at location 896

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
                       *     *  
at location 945

And finally my exact match at:

scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
                    *  *        
at location 177

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
                       *        
at location 203

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 350

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
                       *       *
at location 523

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 822

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
          *         *           
at location 969

So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.

For comparison, here is an RE-based version, that finds 1-mismatch matches only:

import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
    '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:] 
             for i,c in enumerate(seqStr))

searchSeqRE = re.compile(searchSeqREStr)

for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for match in searchSeqRE.finditer(g.gene):
        print "MATCH:", seqStr
        print "FOUND:", match.group(0)
        print "at location", match.start()
        print
    print

(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)

So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.

0 讨论(0)

轻奢々

2020-11-30 03:22
Python regex library supports fuzzy regular expression matching. One advantage over TRE is that it allows to find all matches of regular expression in the text (supports overlapping matches as well).
```
import regex
m=regex.findall("AA", "CAG")
>>> []
m=regex.findall("(AA){e<=1}", "CAAG") # means allow up to 1 error
m
>>> ['CA', 'AG']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...