Matching incorrectly spelt words with correct ones in python

后端未结

关注

 5  907

I\'m building an app that gets incoming SMSs, then based on a keyword, it looks to see if that keyword is associated with any campaigns that it is running. The way I\'m doin

相关标签:

5条回答

执笔经年

2021-01-01 06:12
What you're looking for is Levenshtein Distance.

Assuming your list of campaign isn't too large, you can calculate the distance between the input word and that of each campaign then select the one with the shortest. To filter out completely wrong words you might need to set a minimum acceptable distance and discard the input if the shortest is still beyond the limit.

To calculate the distance between two words, you can try one of these modules:
- levenshtein.py
- python-Levenshtein.
- py-editdist
For example, using levenshtein.py:
```
from levenshtein import levenshtein
campaigns = (
    "HAMSTER",
    "TWO WORDED",
    "FRIDAY",
)

def get_campaign(word):
  return min(campaigns, key=lambda x: levenshtein(word, x))
```
Usage:
```
>>> get_campaign("HAMSTA")
'HAMSTER'
>>> get_campaign("HAM WORDED")
'TWO WORDED'
>>> get_campaign("FROODY")
'FRIDAY'
>>> get_campaign("FRIDAY")
'FRIDAY'
```
Note that is a very simple-minded approach and will always return something even if the input is completely different.
0 讨论(0)
发布评论:

提交评论
- 加载中...

自闭症患者

2021-01-01 06:13

I use levenshtein distance to solve similar problem see http://en.wikipedia.org/wiki/Levenshtein_distance

def distance(u1, u2):
    try:
        s1 = unicode(u1)    
        s2 = unicode(u2)
    except:
        s1 = u1
        s2 = u2        
    if len(s1) < len(s2):
        return distance(u2, u1)
    if not s1:
        return len(s2)

    previous_row = xrange(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]


distance("hamstir", "hamster") < 3
True
distance("god", "hamster") < 3
False

0 讨论(0)

慢半拍i

2021-01-01 06:19
The simplest solution is to use the difflib package, which has a get_close_matches function for approximate string matching:
```
import difflib
difflib.get_close_matches(word, possibilities)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光说笑

2021-01-01 06:32

You could use a fuzzy matching and a named list with regex library e.g., to find any phrase from a list with at most one error (insertion, deletion, substitution):

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex as re # pip install regex

words = ["first word", "second word", "third"]
sms = u"junk Furﬆ Word second Third"

for m in re.finditer(ur"(?fie)\L<words>{e<=1}", sms, words=words):
    print(m[0]) # the match
    print(m.span()) # return indexes where the match found in the sms
    # to find out which of the words matched:
    print(next(w for w in words
               if re.match(ur"(?fi)(?:%s){e<=1}" % re.escape(w), m[0])))

Output

Furﬆ Word
(5, 14)
first word
Third
(22, 27)
third

Or you could iterate over the words directly:

for w in words:
    for m in re.finditer(ur"(?fie)(?:%s){e<=1}" % re.escape(w), sms):
        print(m[0])
        print(m.span())
        print(w)

It produces the same output as the first example.

0 讨论(0)

梦如初夏

2021-01-01 06:34

It seems to me that you're trying to build a spell checker. You could use minimum edit distance matching. Alternatively, look at Peter Norvig's python spell checker

Hope that helps

0 讨论(0)
发布评论:

提交评论
- 加载中...