Say I have the following two strings in my database:
(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'
My sof
You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here
First, your distance score needs to be adjusted based on the length of the database entry and/or input. A distance of 5 against an expression of 10 characters is much worse than a distance of 5 against an expression of 100 characters.
But the main problem with your approach is that plain Levenshtein is not a substring matching algorithm. It compares all of one string with all of another string. Your big distance in case (1) is due to the large number of words in the database expression that are not in the input expression.
To get around that you are better off using an algorithm that can match substrings such as Fuzzy Bitap or Smith–Waterman.
If you have to use Levenshtein or similar you probably want to use it to compare words to words and then generate some score based on the number of matching words and the quality of the matches.
I guess the answer is not required anymore, but i liked the question and it got me into thinking how to combine the advantages of RegEx plus Levenshtein string metric but be less dependent on the distance.
So far i have come up with a parser, that follows this premises and logic:
needle
that is searched for will be stripped from its punctuation charactershaystack
is also stripped of its punctuation characters - So N.A.S.A
would be NASA
- like in the needle
if it was originally N.A.S.A.
- i know this can be problematic for quite some scenarios, but given the premises i couldn't come up with a better solution.needle
that is not at least 3 characters long will be removed (no need for is, on, at, no, etc.)needle
will be split into wordgroup
s containing n
items: n
is defined within a dict 0 < k <= l
where k
is the dict keywordgroup
must follow each other, with a maximum distance of n
words between themn
length, can have a different allowed error threshold: the e
rrors in total, s
ubstitions, i
nserts and d
eletions can be specified, again with a dict holding the key where 0 < k <= n
Online demo here
contextual_fuzzy_matcher.py:
from collections import OrderedDict
import regex
class ContextualFuzzyMatcher(object):
maximum_word_distance = 2
word_distance = r"\s(?:[\w]+\s){{0,{}}}".format(maximum_word_distance)
punctuation = regex.compile(r"[\u2000-\u206F\u2E00-\u2E7F\\'!\"#$%&\(\)\*\+,\-\.\/:;<=>\?@\[\]\^_`\{\|\}~]")
groups = OrderedDict((
(0, lambda l: l),
(4, lambda l: 3),
(8, lambda l: 6),
(10, lambda l: l // 0.75),
))
tolerances = OrderedDict((
(0, {
'e': lambda l: 0,
's': lambda l: 0,
'i': lambda l: 0,
'd': lambda l: 0,
}),
(3, {
'e': lambda l: 1,
's': lambda l: 1,
'i': lambda l: 1,
'd': lambda l: 1,
}),
(6, {
'e': lambda l: 2,
's': lambda l: 1,
'i': lambda l: 1,
'd': lambda l: 1,
}),
(9, {
'e': lambda l: 3,
's': lambda l: 2,
'i': lambda l: 2,
'd': lambda l: 2,
}),
(12, {
'e': lambda l: l // 4,
's': lambda l: l // 6,
'i': lambda l: l // 6,
'd': lambda l: l // 6,
}),
))
def __init__(self, needle):
self.sentence = needle
self.words = self.sentence_to_words(sentence)
self.words_len = len(self.words)
self.group_size = self.get_group_size()
self.word_groups = self.get_word_groups()
self.regexp = self.get_regexp()
def sentence_to_words(self, sentence):
sentence = regex.sub(self.punctuation, "", sentence)
sentence = regex.sub(" +", " ", sentence)
return [word for word in sentence.split(' ') if len(word) > 2]
def get_group_size(self):
return list(value for key, value in self.groups.items() if self.words_len >= key)[-1](self.words_len)
def get_word_groups(self):
return [self.words[i:i + self.group_size] for i in range(self.words_len - self.group_size + 1)]
def get_tolerance(self, word_len):
return list(value for key, value in self.tolerances.items() if word_len >= key)[-1]
def get_regexp(self):
combinations = []
for word_group in self.word_groups:
distants = []
for word in word_group:
word_len = len(word)
tolerance = self.get_tolerance(word_len)
distants.append(r"({}){{e<={},s<={},i<={},d<={}}}".format(
word,
tolerance['e'](word_len),
tolerance['s'](word_len),
tolerance['i'](word_len),
tolerance['d'](word_len),
))
combinations.append(
self.word_distance.join(distants)
)
return regex.compile(
r"|".join(combinations),
regex.MULTILINE | regex.IGNORECASE
)
def findall(self, haystack):
return self.regexp.findall(haystack)
main.py:
test_sentences = [
'Levi Watkins Learning Center - Alabama State University',
'ETH Library'
]
test_texts = [
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sapien eget mi proin sed libero enim sed. Nec tincidunt praesent semper feugiat nibh sed pulvinar. Habitasse platea dictumst quisque sagittis. Tortor condimentum lacinia quis vel eros donec ac odio. Platea dictumst vestibulum rhoncus est pellentesque elit ullamcorper dignissim. Ultricies tristique nulla aliquet enim tortor at. Mi proin sed libero enim sed faucibus. Fames ac turpis egestas integer eget aliquet nibh. Potenti nullam ac tortor vitae purus faucibus ornare suspendisse. Cras semper auctor neque vitae tempus quam pellentesque nec. Quam lacus suspendisse faucibus interdum posuere. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl tincidunt. Viverra tellus in hac habitasse. Nibh nisl condimentum id venenatis a condimentum vitae. Tincidunt dui ut ornare lectus."
"Mattis aliquam faucibus purus in massa tempor nec feugiat nisl. Amet consectetur adipiscing elit ut aliquam purus. Turpis massa tincidunt dui ut ornare. Suscipit tellus mauris a diam maecenas sed enim ut sem. Id consectetur purus ut faucibus pulvinar elementum. Est velit egestas dui id. Felis imperdiet proin fermentum leo. Faucibus nisl tincidunt eget nullam non nisi est sit. Elit pellentesque habitant morbi tristique. Nisi lacus sed viverra tellus. Morbi tristique senectus et netus et malesuada fames. Id diam vel quam elementum pulvinar. Id nibh tortor id aliquet lectus. Sem integer vitae justo eget magna. Quisque sagittis purus sit amet volutpat consequat. Auctor elit sed vulputate mi sit amet. Venenatis lectus magna fringilla urna porttitor rhoncus dolor purus. Adipiscing diam donec adipiscing tristique risus nec feugiat in fermentum. Bibendum est ultricies integer quis."
"Interdum posuere lorem ipsum dolor sit. Convallis convallis tellus id interdum velit. Sollicitudin aliquam ultrices sagittis orci a scelerisque purus. Vel quam elementum pulvinar etiam. Adipiscing bibendum est ultricies integer quis. Tellus molestie nunc non blandit. Sit amet porttitor eget dolor morbi non arcu. Scelerisque purus semper eget duis at tellus. Diam maecenas sed enim ut sem viverra. Vulputate odio ut enim blandit volutpat maecenas. Faucibus purus in massa tempor nec. Bibendum ut tristique et egestas quis ipsum suspendisse. Ut aliquam purus sit amet luctus venenatis lectus magna. Ac placerat vestibulum lectus mauris ultrices eros in cursus turpis. Feugiat pretium nibh ipsum consequat nisl vel pretium. Elit pellentesque habitant morbi tristique senectus et.",
"Found at ETH's own Library", # ' will be a problem - it adds one extra deletion
"State University of Alabama has a learning center called Levi Watkins",
"The ETH library is not to be confused with Alabama State university's Levi Watkins Learning center",
"ETH Library",
"Alabma State Unversity",
"Levi Wtkins Learning"
]
for test_sentence in test_sentences:
parser = ContextualFuzzyMatcher(test_sentence)
for test_text in test_texts:
for match in parser.findall(test_text):
print(match)
returns:
('', '', '', '', '', '', '', '', '', '', '', '', ' Alabama', 'State', 'university')
(' Levi', 'Watkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', '', '', '', '', '', 'Alabma', 'State', 'Unversity')
('Levi', 'Wtkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
(' ETH', 'library')
('ETH', 'Library')
I am fully aware that this is far away from a perfect solution and that my examples were few and not really representative - but maybe by tweaking the configuration and doing a lot of real-world tests, it may be able to cover quite a lot of cases without generating too many false positives. Also since it is class-based, it can be inherited and configured differently for different sources - maybe in scientific texts a maximum word distance of 1 is sufficient, in newspaper articles maybe 3 are needed, and so on.
You should change your approach:
levenshtein Distance is good at calculating similarities in units either they are 'characters' or 'words'.
Conceptually you are considering Alabama and university (2 words) as 2 units and you want to calculate the distance between the words for which levenshtein distance should mean how many words are in between Alabama and University which should be 1.
But, you are trying to apply levenshtein algorithm that is implemented for characters within a word. This implementation will only work for matching the single words NOT sentences.
Its better you should implement your own levenshtein algorithm (using BK-Tree) for 'words' on the top and within each match, you again match the each word using levenshtein for 'characters'.
your result for (1) should be a match with distance 1 with that algorithm and No match for (2).
You haven't really defined why you think option one is a "closer" match, at least not in any algorithmic sense. It seems like you're basing your expectations on the notion that option one has more matching keywords than option two, so why not just match based on the number of keywords in each string?
For example, using Ruby 2.0:
string1 = 'Levi Watkins Learning Center - Alabama State University'
string2 = 'ETH Library'
strings = [str1, str2]
keywords = 'Alabama University'.split
keycount = {}
# Count matching keywords in each string.
strings.each do |str|
keyword_hits = Hash.new(0)
keywords.each { |word| keyword_hits[word] += str.scan(/#{word}/).count }
keyword_count = keyword_hits.values.reduce :+
keycount[str] = keyword_count
end
# Sort by keyword count, and print results.
keycount.sort.reverse.map { |e| pp "#{e.last}: #{e.first}" }
This will print:
"2: Levi Watkins Learning Center - Alabama State University"
"0: ETH Library"
which matches your expectations of the corpus. You might want to make additional passes on the results using other algorithms to refine the results or to break ties, but this should at least get you pointed in the right direction.
You can try to use normalized levenshtein distance:
Li Yujian, Liu Bo, "A Normalized Levenshtein Distance Metric," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091-1095, June 2007, doi:10.1109/TPAMI.2007.1078 http://www.computer.org/csdl/trans/tp/2007/06/i1091-abs.html
They propose to normalize the levenshtein distance. By doing this, a difference of one character in a sequences of longer two weights more than the same difference when comparing sequences of longer 10.