Search for string allowing for one mismatch in any location of the string

后端 未结 13 886
闹比i
闹比i 2020-11-30 02:45

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi

相关标签:
13条回答
  • 2020-11-30 03:08

    You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.

    import tre
    
    pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
    data = """
    In addition to fundamental contributions in several branches of
    theoretical computer science, Donnald Erwin Kuth is the creator of
    the TeX computer typesetting system, the related METAFONT font
    definition language and rendering system, and the Computer Modern
    family of typefaces.
    """
    
    fz = tre.Fuzzyness(maxerr = 3)
    print fz
    m = pt.search(data, fz)
    
    if m:
        print m.groups()
        print m[0]
    

    which will output

    tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
    ((95, 113), (99, 108), (102, 108))
    Donnald Erwin Kuth
    

    http://en.wikipedia.org/wiki/TRE_%28computing%29

    http://laurikari.net/tre/

    0 讨论(0)
  • 2020-11-30 03:09

    This is quite old but perhaps this simple solution could work. loop through the sequence taking 25character slices. convert the slice to an numpy array. Compare to the 25char string (also as a numpy array). Sum the answer and if the answer is 24 print out the position in the loop and the mismatch.

    te next few lines show it working

    import numpy as np

    a = ['A','B','C']

    b = np.array(a)

    b

    array(['A', 'B', 'C'], dtype='

    c = ['A','D','C']

    d = np.array(c)

    b==d

    array([ True, False, True])

    sum(b==d)

    2

    0 讨论(0)
  • 2020-11-30 03:11

    I guess this may come a bit late, but there is a tool named PatMaN that does exactly what you want: http://bioinf.eva.mpg.de/patman/

    0 讨论(0)
  • 2020-11-30 03:12

    You could use Pythons built in capability to do the search with regular expression matching.

    re module in python http://docs.python.org/library/re.html

    regular expression primer http://www.regular-expressions.info/

    0 讨论(0)
  • 2020-11-30 03:20

    I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:

    fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
    fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!
    
    """
    Sample header:
    >gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
    """
    integer = Word(nums).setParseAction(lambda t:int(t[0]))
    genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") + 
                    "length=" + integer("genelen") + LineEnd() + 
                    Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))
    
    # read gene data from .fasta file - takes just under a couple of minutes
    genedata = OneOrMore(genebit).parseString(fastasrc)
    

    (Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)

    Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:

    class CloseMatch(Token):
        def __init__(self, seq, maxMismatches=1):
            super(CloseMatch,self).__init__()
            self.name = seq
            self.sequence = seq
            self.maxMismatches = maxMismatches
            self.errmsg = "Expected " + self.sequence
            self.mayIndexError = False
            self.mayReturnEmpty = False
    
        def parseImpl( self, instring, loc, doActions=True ):
            start = loc
            instrlen = len(instring)
            maxloc = start + len(self.sequence)
    
            if maxloc <= instrlen:
                seq = self.sequence
                seqloc = 0
                mismatches = []
                throwException = False
                done = False
                while loc < maxloc and not done:
                    if instring[loc] != seq[seqloc]:
                        mismatches.append(seqloc)
                        if len(mismatches) > self.maxMismatches:
                            throwException = True
                            done = True
                    loc += 1
                    seqloc += 1
            else:
                throwException = True
    
            if throwException:
                exc = self.myException
                exc.loc = loc
                exc.pstr = instring
                raise exc
    
            return loc, (instring[start:loc],mismatches)
    

    For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)

    I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):

    searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
    for g in genedata:
        print "%s (%d)" % (g.id, g.genelen)
        print "-"*24
        for t,startLoc,endLoc in searchseq.scanString(g.gene):
            matched, mismatches = t[0]
            print "MATCH:", searchseq.sequence
            print "FOUND:", matched
            if mismatches:
                print "      ", ''.join(' ' if i not in mismatches else '*' 
                                for i,c in enumerate(searchseq.sequence))
            else:
                print "<exact match>"
            print "at location", startLoc
            print
        print
    

    I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.

    This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:

    scf_1104442825154 (964)
    ------------------------
    
    scf_1104442822828 (942)
    ------------------------
    
    scf_1104442824510 (987)
    ------------------------
    
    scf_1104442823180 (1065)
    ------------------------
    ...
    

    I was getting discouraged, not to see any matches until:

    scf_1104442823952 (1188)
    ------------------------
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAACGGAATCGAATGGAAT
                    *      *        
    at location 33
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAT
                           *        
    at location 175
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAT
                           *        
    at location 474
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAT
                           *        
    at location 617
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATAGAAT
                           *   *    
    at location 718
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGATTCGAATGGAAT
                        *  *        
    at location 896
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGTAT
                           *     *  
    at location 945
    

    And finally my exact match at:

    scf_1104442823584 (1448)
    ------------------------
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGACTCGAATGGAAT
                        *  *        
    at location 177
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCAAATGGAAT
                           *        
    at location 203
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCAAATGGAATCGAATGGAAT
                 *         *        
    at location 350
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAA
                           *       *
    at location 523
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCAAATGGAATCGAATGGAAT
                 *         *        
    at location 822
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCTAATGGAAT
    <exact match>
    at location 848
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCGTCGAATGGAGTCTAATGGAAT
              *         *           
    at location 969
    

    So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.

    For comparison, here is an RE-based version, that finds 1-mismatch matches only:

    import re
    seqStr = "ATCATCGAATGGAATCTAATGGAAT"
    searchSeqREStr = seqStr + '|' + \
        '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:] 
                 for i,c in enumerate(seqStr))
    
    searchSeqRE = re.compile(searchSeqREStr)
    
    for g in genedata:
        print "%s (%d)" % (g.id, g.genelen)
        print "-"*24
        for match in searchSeqRE.finditer(g.gene):
            print "MATCH:", seqStr
            print "FOUND:", match.group(0)
            print "at location", match.start()
            print
        print
    

    (At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)

    So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.

    0 讨论(0)
  • 2020-11-30 03:22

    Python regex library supports fuzzy regular expression matching. One advantage over TRE is that it allows to find all matches of regular expression in the text (supports overlapping matches as well).

    import regex
    m=regex.findall("AA", "CAG")
    >>> []
    m=regex.findall("(AA){e<=1}", "CAAG") # means allow up to 1 error
    m
    >>> ['CA', 'AG']
    
    0 讨论(0)
提交回复
热议问题