Alternative to Levenshtein and Trigram

前端 未结 6 840
春和景丽
春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

相关标签:
6条回答
  • 2021-02-07 10:18

    You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here

    0 讨论(0)
  • 2021-02-07 10:18

    First, your distance score needs to be adjusted based on the length of the database entry and/or input. A distance of 5 against an expression of 10 characters is much worse than a distance of 5 against an expression of 100 characters.

    But the main problem with your approach is that plain Levenshtein is not a substring matching algorithm. It compares all of one string with all of another string. Your big distance in case (1) is due to the large number of words in the database expression that are not in the input expression.

    To get around that you are better off using an algorithm that can match substrings such as Fuzzy Bitap or Smith–Waterman.

    If you have to use Levenshtein or similar you probably want to use it to compare words to words and then generate some score based on the number of matching words and the quality of the matches.

    0 讨论(0)
  • 2021-02-07 10:21

    I guess the answer is not required anymore, but i liked the question and it got me into thinking how to combine the advantages of RegEx plus Levenshtein string metric but be less dependent on the distance.

    So far i have come up with a parser, that follows this premises and logic:

    • It uses Python3 and the regex module (OP didn't mention any language/module requirements)
    • Any needle that is searched for will be stripped from its punctuation characters
    • Every haystack is also stripped of its punctuation characters - So N.A.S.A would be NASA - like in the needle if it was originally N.A.S.A. - i know this can be problematic for quite some scenarios, but given the premises i couldn't come up with a better solution.
    • Every word within needle that is not at least 3 characters long will be removed (no need for is, on, at, no, etc.)
    • Matching is case-insensitive
    • The needle will be split into wordgroups containing n items: n is defined within a dict 0 < k <= l where k is the dict key
    • Each of the words within a wordgroup must follow each other, with a maximum distance of n words between them
    • Each word, depending on its n length, can have a different allowed error threshold: the errors in total, substitions, inserts and deletions can be specified, again with a dict holding the key where 0 < k <= n
    • Both previously mentioned dict's contain key/lambda pairs, which is useful for their last/first item to make calculations

    Online demo here

    contextual_fuzzy_matcher.py:

    from collections import OrderedDict
    import regex
    
    
    class ContextualFuzzyMatcher(object):
        maximum_word_distance = 2
        word_distance = r"\s(?:[\w]+\s){{0,{}}}".format(maximum_word_distance)
        punctuation = regex.compile(r"[\u2000-\u206F\u2E00-\u2E7F\\'!\"#$%&\(\)\*\+,\-\.\/:;<=>\?@\[\]\^_`\{\|\}~]")
        groups = OrderedDict((
            (0, lambda l: l),
            (4, lambda l: 3),
            (8, lambda l: 6),
            (10, lambda l: l // 0.75),
        ))
        tolerances = OrderedDict((
            (0, {
                'e': lambda l: 0,
                's': lambda l: 0,
                'i': lambda l: 0,
                'd': lambda l: 0,
            }),
            (3, {
                'e': lambda l: 1,
                's': lambda l: 1,
                'i': lambda l: 1,
                'd': lambda l: 1,
            }),
            (6, {
                'e': lambda l: 2,
                's': lambda l: 1,
                'i': lambda l: 1,
                'd': lambda l: 1,
            }),
            (9, {
                'e': lambda l: 3,
                's': lambda l: 2,
                'i': lambda l: 2,
                'd': lambda l: 2,
            }),
            (12, {
                'e': lambda l: l // 4,
                's': lambda l: l // 6,
                'i': lambda l: l // 6,
                'd': lambda l: l // 6,
            }),
        ))
    
        def __init__(self, needle):
            self.sentence = needle
            self.words = self.sentence_to_words(sentence)
            self.words_len = len(self.words)
            self.group_size = self.get_group_size()
            self.word_groups = self.get_word_groups()
            self.regexp = self.get_regexp()
    
        def sentence_to_words(self, sentence):
            sentence = regex.sub(self.punctuation, "", sentence)
            sentence = regex.sub(" +", " ", sentence)
            return [word for word in sentence.split(' ') if len(word) > 2]
    
        def get_group_size(self):
            return list(value for key, value in self.groups.items() if self.words_len >= key)[-1](self.words_len)
    
        def get_word_groups(self):
            return [self.words[i:i + self.group_size] for i in range(self.words_len - self.group_size + 1)]
    
        def get_tolerance(self, word_len):
            return list(value for key, value in self.tolerances.items() if word_len >= key)[-1]
    
        def get_regexp(self):
            combinations = []
            for word_group in self.word_groups:
                distants = []
                for word in word_group:
                    word_len = len(word)
                    tolerance = self.get_tolerance(word_len)
                    distants.append(r"({}){{e<={},s<={},i<={},d<={}}}".format(
                        word,
                        tolerance['e'](word_len),
                        tolerance['s'](word_len),
                        tolerance['i'](word_len),
                        tolerance['d'](word_len),
                    ))
                combinations.append(
                    self.word_distance.join(distants)
                )
            return regex.compile(
                r"|".join(combinations),
                regex.MULTILINE | regex.IGNORECASE
            )
    
        def findall(self, haystack):
            return self.regexp.findall(haystack)
    

    main.py:

    test_sentences = [
        'Levi Watkins Learning Center - Alabama State University',
        'ETH Library'
    ]
    test_texts = [
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sapien eget mi proin sed libero enim sed. Nec tincidunt praesent semper feugiat nibh sed pulvinar. Habitasse platea dictumst quisque sagittis. Tortor condimentum lacinia quis vel eros donec ac odio. Platea dictumst vestibulum rhoncus est pellentesque elit ullamcorper dignissim. Ultricies tristique nulla aliquet enim tortor at. Mi proin sed libero enim sed faucibus. Fames ac turpis egestas integer eget aliquet nibh. Potenti nullam ac tortor vitae purus faucibus ornare suspendisse. Cras semper auctor neque vitae tempus quam pellentesque nec. Quam lacus suspendisse faucibus interdum posuere. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl tincidunt. Viverra tellus in hac habitasse. Nibh nisl condimentum id venenatis a condimentum vitae. Tincidunt dui ut ornare lectus."
        "Mattis aliquam faucibus purus in massa tempor nec feugiat nisl. Amet consectetur adipiscing elit ut aliquam purus. Turpis massa tincidunt dui ut ornare. Suscipit tellus mauris a diam maecenas sed enim ut sem. Id consectetur purus ut faucibus pulvinar elementum. Est velit egestas dui id. Felis imperdiet proin fermentum leo. Faucibus nisl tincidunt eget nullam non nisi est sit. Elit pellentesque habitant morbi tristique. Nisi lacus sed viverra tellus. Morbi tristique senectus et netus et malesuada fames. Id diam vel quam elementum pulvinar. Id nibh tortor id aliquet lectus. Sem integer vitae justo eget magna. Quisque sagittis purus sit amet volutpat consequat. Auctor elit sed vulputate mi sit amet. Venenatis lectus magna fringilla urna porttitor rhoncus dolor purus. Adipiscing diam donec adipiscing tristique risus nec feugiat in fermentum. Bibendum est ultricies integer quis."
        "Interdum posuere lorem ipsum dolor sit. Convallis convallis tellus id interdum velit. Sollicitudin aliquam ultrices sagittis orci a scelerisque purus. Vel quam elementum pulvinar etiam. Adipiscing bibendum est ultricies integer quis. Tellus molestie nunc non blandit. Sit amet porttitor eget dolor morbi non arcu. Scelerisque purus semper eget duis at tellus. Diam maecenas sed enim ut sem viverra. Vulputate odio ut enim blandit volutpat maecenas. Faucibus purus in massa tempor nec. Bibendum ut tristique et egestas quis ipsum suspendisse. Ut aliquam purus sit amet luctus venenatis lectus magna. Ac placerat vestibulum lectus mauris ultrices eros in cursus turpis. Feugiat pretium nibh ipsum consequat nisl vel pretium. Elit pellentesque habitant morbi tristique senectus et.",
        "Found at ETH's own Library", # ' will be a problem - it adds one extra deletion
        "State University of Alabama has a learning center called Levi Watkins",
        "The ETH library is not to be confused with Alabama State university's Levi Watkins Learning center",
        "ETH Library",
        "Alabma State Unversity",
        "Levi Wtkins Learning"
    ]
    
    
    for test_sentence in test_sentences:
        parser = ContextualFuzzyMatcher(test_sentence)
        for test_text in test_texts:
            for match in parser.findall(test_text):
                print(match)
    

    returns:

    ('', '', '', '', '', '', '', '', '', '', '', '', ' Alabama', 'State', 'university')
    (' Levi', 'Watkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
    ('', '', '', '', '', '', '', '', '', '', '', '', 'Alabma', 'State', 'Unversity')
    ('Levi', 'Wtkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
    (' ETH', 'library')
    ('ETH', 'Library')
    

    I am fully aware that this is far away from a perfect solution and that my examples were few and not really representative - but maybe by tweaking the configuration and doing a lot of real-world tests, it may be able to cover quite a lot of cases without generating too many false positives. Also since it is class-based, it can be inherited and configured differently for different sources - maybe in scientific texts a maximum word distance of 1 is sufficient, in newspaper articles maybe 3 are needed, and so on.

    0 讨论(0)
  • 2021-02-07 10:25

    You should change your approach:

    levenshtein Distance is good at calculating similarities in units either they are 'characters' or 'words'.

    Conceptually you are considering Alabama and university (2 words) as 2 units and you want to calculate the distance between the words for which levenshtein distance should mean how many words are in between Alabama and University which should be 1.

    But, you are trying to apply levenshtein algorithm that is implemented for characters within a word. This implementation will only work for matching the single words NOT sentences.

    Its better you should implement your own levenshtein algorithm (using BK-Tree) for 'words' on the top and within each match, you again match the each word using levenshtein for 'characters'.

    your result for (1) should be a match with distance 1 with that algorithm and No match for (2).

    0 讨论(0)
  • 2021-02-07 10:32

    Keyword Counting

    You haven't really defined why you think option one is a "closer" match, at least not in any algorithmic sense. It seems like you're basing your expectations on the notion that option one has more matching keywords than option two, so why not just match based on the number of keywords in each string?

    For example, using Ruby 2.0:

    string1 = 'Levi Watkins Learning Center - Alabama State University'
    string2 = 'ETH Library'
    strings = [str1, str2]
    
    keywords  = 'Alabama University'.split
    keycount  = {}
    
    # Count matching keywords in each string.
    strings.each do |str|
      keyword_hits  = Hash.new(0)
      keywords.each { |word| keyword_hits[word] += str.scan(/#{word}/).count }
      keyword_count = keyword_hits.values.reduce :+
      keycount[str] =  keyword_count
    end
    
    # Sort by keyword count, and print results.
    keycount.sort.reverse.map { |e| pp "#{e.last}: #{e.first}" }
    

    This will print:

    "2: Levi Watkins Learning Center - Alabama State University"
    "0: ETH Library"

    which matches your expectations of the corpus. You might want to make additional passes on the results using other algorithms to refine the results or to break ties, but this should at least get you pointed in the right direction.

    0 讨论(0)
  • 2021-02-07 10:37

    You can try to use normalized levenshtein distance:

    Li Yujian, Liu Bo, "A Normalized Levenshtein Distance Metric," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091-1095, June 2007, doi:10.1109/TPAMI.2007.1078 http://www.computer.org/csdl/trans/tp/2007/06/i1091-abs.html

    They propose to normalize the levenshtein distance. By doing this, a difference of one character in a sequences of longer two weights more than the same difference when comparing sequences of longer 10.

    0 讨论(0)
提交回复
热议问题