Matching incorrectly spelt words with correct ones in python

后端 未结 5 907
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-01 05:36

I\'m building an app that gets incoming SMSs, then based on a keyword, it looks to see if that keyword is associated with any campaigns that it is running. The way I\'m doin

相关标签:
5条回答
  • 2021-01-01 06:12

    What you're looking for is Levenshtein Distance.

    Assuming your list of campaign isn't too large, you can calculate the distance between the input word and that of each campaign then select the one with the shortest. To filter out completely wrong words you might need to set a minimum acceptable distance and discard the input if the shortest is still beyond the limit.

    To calculate the distance between two words, you can try one of these modules:

    • levenshtein.py
    • python-Levenshtein.
    • py-editdist

    For example, using levenshtein.py:

    from levenshtein import levenshtein
    campaigns = (
        "HAMSTER",
        "TWO WORDED",
        "FRIDAY",
    )
    
    def get_campaign(word):
      return min(campaigns, key=lambda x: levenshtein(word, x))
    

    Usage:

    >>> get_campaign("HAMSTA")
    'HAMSTER'
    >>> get_campaign("HAM WORDED")
    'TWO WORDED'
    >>> get_campaign("FROODY")
    'FRIDAY'
    >>> get_campaign("FRIDAY")
    'FRIDAY'
    

    Note that is a very simple-minded approach and will always return something even if the input is completely different.

    0 讨论(0)
  • 2021-01-01 06:13

    I use levenshtein distance to solve similar problem see http://en.wikipedia.org/wiki/Levenshtein_distance

    def distance(u1, u2):
        try:
            s1 = unicode(u1)    
            s2 = unicode(u2)
        except:
            s1 = u1
            s2 = u2        
        if len(s1) < len(s2):
            return distance(u2, u1)
        if not s1:
            return len(s2)
    
        previous_row = xrange(len(s2) + 1)
        for i, c1 in enumerate(s1):
            current_row = [i + 1]
            for j, c2 in enumerate(s2):
                insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
                deletions = current_row[j] + 1       # than s2
                substitutions = previous_row[j] + (c1 != c2)
                current_row.append(min(insertions, deletions, substitutions))
            previous_row = current_row
    
        return previous_row[-1]
    
    
    distance("hamstir", "hamster") < 3
    True
    distance("god", "hamster") < 3
    False
    
    0 讨论(0)
  • 2021-01-01 06:19

    The simplest solution is to use the difflib package, which has a get_close_matches function for approximate string matching:

    import difflib
    difflib.get_close_matches(word, possibilities)
    
    0 讨论(0)
  • 2021-01-01 06:32

    You could use a fuzzy matching and a named list with regex library e.g., to find any phrase from a list with at most one error (insertion, deletion, substitution):

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import regex as re # pip install regex
    
    words = ["first word", "second word", "third"]
    sms = u"junk Furst Word second Third"
    
    for m in re.finditer(ur"(?fie)\L<words>{e<=1}", sms, words=words):
        print(m[0]) # the match
        print(m.span()) # return indexes where the match found in the sms
        # to find out which of the words matched:
        print(next(w for w in words
                   if re.match(ur"(?fi)(?:%s){e<=1}" % re.escape(w), m[0])))
    

    Output

    Furst Word
    (5, 14)
    first word
    Third
    (22, 27)
    third
    

    Or you could iterate over the words directly:

    for w in words:
        for m in re.finditer(ur"(?fie)(?:%s){e<=1}" % re.escape(w), sms):
            print(m[0])
            print(m.span())
            print(w)
    

    It produces the same output as the first example.

    0 讨论(0)
  • 2021-01-01 06:34

    It seems to me that you're trying to build a spell checker. You could use minimum edit distance matching. Alternatively, look at Peter Norvig's python spell checker

    Hope that helps

    0 讨论(0)
提交回复
热议问题