I\'m building an app that gets incoming SMSs, then based on a keyword, it looks to see if that keyword is associated with any campaigns that it is running. The way I\'m doin
What you're looking for is Levenshtein Distance.
Assuming your list of campaign isn't too large, you can calculate the distance between the input word and that of each campaign then select the one with the shortest. To filter out completely wrong words you might need to set a minimum acceptable distance and discard the input if the shortest is still beyond the limit.
To calculate the distance between two words, you can try one of these modules:
For example, using levenshtein.py
:
from levenshtein import levenshtein
campaigns = (
"HAMSTER",
"TWO WORDED",
"FRIDAY",
)
def get_campaign(word):
return min(campaigns, key=lambda x: levenshtein(word, x))
Usage:
>>> get_campaign("HAMSTA")
'HAMSTER'
>>> get_campaign("HAM WORDED")
'TWO WORDED'
>>> get_campaign("FROODY")
'FRIDAY'
>>> get_campaign("FRIDAY")
'FRIDAY'
Note that is a very simple-minded approach and will always return something even if the input is completely different.
I use levenshtein distance to solve similar problem see http://en.wikipedia.org/wiki/Levenshtein_distance
def distance(u1, u2):
try:
s1 = unicode(u1)
s2 = unicode(u2)
except:
s1 = u1
s2 = u2
if len(s1) < len(s2):
return distance(u2, u1)
if not s1:
return len(s2)
previous_row = xrange(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
distance("hamstir", "hamster") < 3
True
distance("god", "hamster") < 3
False
The simplest solution is to use the difflib package, which has a get_close_matches
function for approximate string matching:
import difflib
difflib.get_close_matches(word, possibilities)
You could use a fuzzy matching and a named list with regex library e.g., to find any phrase from a list with at most one error (insertion, deletion, substitution):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex as re # pip install regex
words = ["first word", "second word", "third"]
sms = u"junk Furst Word second Third"
for m in re.finditer(ur"(?fie)\L<words>{e<=1}", sms, words=words):
print(m[0]) # the match
print(m.span()) # return indexes where the match found in the sms
# to find out which of the words matched:
print(next(w for w in words
if re.match(ur"(?fi)(?:%s){e<=1}" % re.escape(w), m[0])))
Furst Word
(5, 14)
first word
Third
(22, 27)
third
Or you could iterate over the words directly:
for w in words:
for m in re.finditer(ur"(?fie)(?:%s){e<=1}" % re.escape(w), sms):
print(m[0])
print(m.span())
print(w)
It produces the same output as the first example.
It seems to me that you're trying to build a spell checker. You could use minimum edit distance matching. Alternatively, look at Peter Norvig's python spell checker
Hope that helps