Alternative to Levenshtein and Trigram

前端未结

关注

 6  845

春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

6条回答

孤独总比滥情好 (楼主)

2021-02-07 10:21

I guess the answer is not required anymore, but i liked the question and it got me into thinking how to combine the advantages of RegEx plus Levenshtein string metric but be less dependent on the distance.

So far i have come up with a parser, that follows this premises and logic:

It uses Python3 and the regex module (OP didn't mention any language/module requirements)
Any needle that is searched for will be stripped from its punctuation characters
Every haystack is also stripped of its punctuation characters - So N.A.S.A would be NASA - like in the needle if it was originally N.A.S.A. - i know this can be problematic for quite some scenarios, but given the premises i couldn't come up with a better solution.
Every word within needle that is not at least 3 characters long will be removed (no need for is, on, at, no, etc.)
Matching is case-insensitive
The needle will be split into wordgroups containing n items: n is defined within a dict 0 < k <= l where k is the dict key
Each of the words within a wordgroup must follow each other, with a maximum distance of n words between them
Each word, depending on its n length, can have a different allowed error threshold: the errors in total, substitions, inserts and deletions can be specified, again with a dict holding the key where 0 < k <= n
Both previously mentioned dict's contain key/lambda pairs, which is useful for their last/first item to make calculations

Online demo here

contextual_fuzzy_matcher.py:

from collections import OrderedDict
import regex


class ContextualFuzzyMatcher(object):
    maximum_word_distance = 2
    word_distance = r"\s(?:[\w]+\s){{0,{}}}".format(maximum_word_distance)
    punctuation = regex.compile(r"[\u2000-\u206F\u2E00-\u2E7F\\'!\"#$%&\(\)\*\+,\-\.\/:;<=>\?@\[\]\^_`\{\|\}~]")
    groups = OrderedDict((
        (0, lambda l: l),
        (4, lambda l: 3),
        (8, lambda l: 6),
        (10, lambda l: l // 0.75),
    ))
    tolerances = OrderedDict((
        (0, {
            'e': lambda l: 0,
            's': lambda l: 0,
            'i': lambda l: 0,
            'd': lambda l: 0,
        }),
        (3, {
            'e': lambda l: 1,
            's': lambda l: 1,
            'i': lambda l: 1,
            'd': lambda l: 1,
        }),
        (6, {
            'e': lambda l: 2,
            's': lambda l: 1,
            'i': lambda l: 1,
            'd': lambda l: 1,
        }),
        (9, {
            'e': lambda l: 3,
            's': lambda l: 2,
            'i': lambda l: 2,
            'd': lambda l: 2,
        }),
        (12, {
            'e': lambda l: l // 4,
            's': lambda l: l // 6,
            'i': lambda l: l // 6,
            'd': lambda l: l // 6,
        }),
    ))

    def __init__(self, needle):
        self.sentence = needle
        self.words = self.sentence_to_words(sentence)
        self.words_len = len(self.words)
        self.group_size = self.get_group_size()
        self.word_groups = self.get_word_groups()
        self.regexp = self.get_regexp()

    def sentence_to_words(self, sentence):
        sentence = regex.sub(self.punctuation, "", sentence)
        sentence = regex.sub(" +", " ", sentence)
        return [word for word in sentence.split(' ') if len(word) > 2]

    def get_group_size(self):
        return list(value for key, value in self.groups.items() if self.words_len >= key)[-1](self.words_len)

    def get_word_groups(self):
        return [self.words[i:i + self.group_size] for i in range(self.words_len - self.group_size + 1)]

    def get_tolerance(self, word_len):
        return list(value for key, value in self.tolerances.items() if word_len >= key)[-1]

    def get_regexp(self):
        combinations = []
        for word_group in self.word_groups:
            distants = []
            for word in word_group:
                word_len = len(word)
                tolerance = self.get_tolerance(word_len)
                distants.append(r"({}){{e<={},s<={},i<={},d<={}}}".format(
                    word,
                    tolerance['e'](word_len),
                    tolerance['s'](word_len),
                    tolerance['i'](word_len),
                    tolerance['d'](word_len),
                ))
            combinations.append(
                self.word_distance.join(distants)
            )
        return regex.compile(
            r"|".join(combinations),
            regex.MULTILINE | regex.IGNORECASE
        )

    def findall(self, haystack):
        return self.regexp.findall(haystack)

main.py:

test_sentences = [
    'Levi Watkins Learning Center - Alabama State University',
    'ETH Library'
]
test_texts = [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sapien eget mi proin sed libero enim sed. Nec tincidunt praesent semper feugiat nibh sed pulvinar. Habitasse platea dictumst quisque sagittis. Tortor condimentum lacinia quis vel eros donec ac odio. Platea dictumst vestibulum rhoncus est pellentesque elit ullamcorper dignissim. Ultricies tristique nulla aliquet enim tortor at. Mi proin sed libero enim sed faucibus. Fames ac turpis egestas integer eget aliquet nibh. Potenti nullam ac tortor vitae purus faucibus ornare suspendisse. Cras semper auctor neque vitae tempus quam pellentesque nec. Quam lacus suspendisse faucibus interdum posuere. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl tincidunt. Viverra tellus in hac habitasse. Nibh nisl condimentum id venenatis a condimentum vitae. Tincidunt dui ut ornare lectus."
    "Mattis aliquam faucibus purus in massa tempor nec feugiat nisl. Amet consectetur adipiscing elit ut aliquam purus. Turpis massa tincidunt dui ut ornare. Suscipit tellus mauris a diam maecenas sed enim ut sem. Id consectetur purus ut faucibus pulvinar elementum. Est velit egestas dui id. Felis imperdiet proin fermentum leo. Faucibus nisl tincidunt eget nullam non nisi est sit. Elit pellentesque habitant morbi tristique. Nisi lacus sed viverra tellus. Morbi tristique senectus et netus et malesuada fames. Id diam vel quam elementum pulvinar. Id nibh tortor id aliquet lectus. Sem integer vitae justo eget magna. Quisque sagittis purus sit amet volutpat consequat. Auctor elit sed vulputate mi sit amet. Venenatis lectus magna fringilla urna porttitor rhoncus dolor purus. Adipiscing diam donec adipiscing tristique risus nec feugiat in fermentum. Bibendum est ultricies integer quis."
    "Interdum posuere lorem ipsum dolor sit. Convallis convallis tellus id interdum velit. Sollicitudin aliquam ultrices sagittis orci a scelerisque purus. Vel quam elementum pulvinar etiam. Adipiscing bibendum est ultricies integer quis. Tellus molestie nunc non blandit. Sit amet porttitor eget dolor morbi non arcu. Scelerisque purus semper eget duis at tellus. Diam maecenas sed enim ut sem viverra. Vulputate odio ut enim blandit volutpat maecenas. Faucibus purus in massa tempor nec. Bibendum ut tristique et egestas quis ipsum suspendisse. Ut aliquam purus sit amet luctus venenatis lectus magna. Ac placerat vestibulum lectus mauris ultrices eros in cursus turpis. Feugiat pretium nibh ipsum consequat nisl vel pretium. Elit pellentesque habitant morbi tristique senectus et.",
    "Found at ETH's own Library", # ' will be a problem - it adds one extra deletion
    "State University of Alabama has a learning center called Levi Watkins",
    "The ETH library is not to be confused with Alabama State university's Levi Watkins Learning center",
    "ETH Library",
    "Alabma State Unversity",
    "Levi Wtkins Learning"
]


for test_sentence in test_sentences:
    parser = ContextualFuzzyMatcher(test_sentence)
    for test_text in test_texts:
        for match in parser.findall(test_text):
            print(match)

returns:

('', '', '', '', '', '', '', '', '', '', '', '', ' Alabama', 'State', 'university')
(' Levi', 'Watkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', '', '', '', '', '', 'Alabma', 'State', 'Unversity')
('Levi', 'Wtkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
(' ETH', 'library')
('ETH', 'Library')

I am fully aware that this is far away from a perfect solution and that my examples were few and not really representative - but maybe by tweaking the configuration and doing a lot of real-world tests, it may be able to cover quite a lot of cases without generating too many false positives. Also since it is class-based, it can be inherited and configured differently for different sources - maybe in scientific texts a maximum word distance of 1 is sufficient, in newspaper articles maybe 3 are needed, and so on.

0 讨论(0)

查看其它6个回答