Algorithms for “fuzzy matching” strings

后端 未结 6 518
傲寒
傲寒 2020-12-12 13:23

By fuzzy matching I don\'t mean similar strings by Levenshtein distance or something similar, but the way it\'s used in TextMate/Ido/Icicles: given a list of strings, find t

相关标签:
6条回答
  • 2020-12-12 13:36

    I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)

    I think it would be useful to state the problem and the requirements more clearly.

    Problem:

    We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.

    1. It is expected that all the letters of the input be in the keyword
    2. It is expected that the letters in the input be in the same order in the keyword
    3. The list of keywords returned should be presented in a consistent (reproductible) order
    4. The algorithm should be case insensitive

    Analysis:

    The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*

    The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.

    Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.

    There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.

    My solution:

    For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.

    I would then match my (alphabetically sorted) list of keywords, and output it so filtered.

    In pseudo-code:

    WORDS = ['Bar', 'Foo', 'FooBar', 'Other']
    
    def GetList(input, words = WORDS):
      expr = ['[^' + i + ']*' + i for i in input]
      return [w for w in words if re.match(expr, w, re.IGNORECASE)]
    

    I could have used a one-liner but thought it would obscure the code ;)

    This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:

    • Either there are few characters, thus the matching is quick and the length of the list does not matter much
    • Either there are a lots of characters, and this means we are filtering a short list, thus it does not matter too much if the matching takes a bit longer element-wise

    I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.

    0 讨论(0)
  • 2020-12-12 13:42

    If your text is predominantly English then you may try your hand at various Soundex algorithms 1. Classic soundex 2. Metafone

    These algorithms will let you choose words which sound like each other and will be a good way to find misspelled words.

    0 讨论(0)
  • 2020-12-12 13:43

    Levenshtein 'Edit Distance' algorithms will definitely work on what you're trying to do: they will give you a measurement of how closely two words or addresses or phone numbers, psalms, monologues and scholarly articles match each other, allowing you you rank the results and choose the best match.

    A more lightweight appproach is to count up the common substrings: it's not as good as Levenshtein, but it provides usable results and runs quickly in slow languages which have access to fast 'InString' functions.

    I published an Excel 'Fuzzy Lookup' in Excellerando a few years ago, using 'FuzzyMatchScore' function that is, as far as I can tell, exactly what you need:

    http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html

    It is, of course, in Visual Basic for Applications. Proceed with caution, crucifixes and garlic:

    Public Function SumOfCommonStrings( _
                                ByVal s1 As String, _
                                ByVal s2 As String, _
                                Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
                                Optional iScore As Integer = 0 _
                                    ) As Integer
    
    Application.Volatile False
    
    ' N.Heffernan 06 June 2006 
    ' THIS CODE IS IN THE PUBLIC DOMAIN
    
    
    ' Function to measure how much of String 1 is made up of substrings found in String 2
    
    ' This function uses a modified Longest Common String algorithm.
    ' Simple LCS algorithms are unduly sensitive to single-letter
    ' deletions/changes near the midpoint of the test words, eg:
    ' Wednesday is obviously closer to WedXesday on an edit-distance
    ' basis than it is to WednesXXX. So it would be better to score
    ' the 'Wed' as well as the 'esday' and add up the total matched
    
    ' Watch out for strings of differing lengths:
    '
    '    SumOfCommonStrings("Wednesday", "WednesXXXday")
    '
    ' This scores the same as:
    '
    '     SumOfCommonStrings("Wednesday", "Wednesday")
    '
    ' So make sure the calling function uses the length of the longest
    ' string when calculating the degree of similarity from this score.
    
    
    ' This is coded for clarity, not for performance.
    
    Dim arr() As Integer    ' Scoring matrix
    Dim n As Integer        ' length of s1
    Dim m As Integer        ' length of s2
    Dim i As Integer        ' start position in s1
    Dim j As Integer        ' start position in s2
    Dim subs1 As String     ' a substring of s1
    Dim len1 As Integer     ' length of subs1
    
    Dim sBefore1            ' documented in the code
    Dim sBefore2
    Dim sAfter1
    Dim sAfter2
    
    Dim s3 As String
    
    
    SumOfCommonStrings = iScore
    
    n = Len(s1)
    m = Len(s2)
    
    If s1 = s2 Then
        SumOfCommonStrings = n
        Exit Function
    End If
    
    If n = 0 Or m = 0 Then
        Exit Function
    End If
    
    's1 should always be the shorter of the two strings:
    If n > m Then
        s3 = s2
        s2 = s1
        s1 = s3
        n = Len(s1)
        m = Len(s2)
    End If
    
    n = Len(s1)
    m = Len(s2)
    
    ' Special case: s1 is n exact substring of s2
    If InStr(1, s2, s1, Compare) Then
        SumOfCommonStrings = n
        Exit Function
    End If
    
    For len1 = n To 1 Step -1
    
        For i = 1 To n - len1 + 1
    
            subs1 = Mid(s1, i, len1)
            j = 0
            j = InStr(1, s2, subs1, Compare)
    
            If j > 0 Then
    
                ' We've found a matching substring...
                iScore = iScore + len1            
    
              ' Now clip out this substring from s1 and s2...
              ' And search the fragments before and after this excision:
    
    
                If i > 1 And j > 1 Then
                    sBefore1 = left(s1, i - 1)
                    sBefore2 = left(s2, j - 1)
                    iScore = SumOfCommonStrings(sBefore1, _
                                                sBefore2, _
                                                Compare, _
                                                iScore)
                End If
    
    
                If i + len1 < n And j + len1 < m Then
                    sAfter1 = right(s1, n + 1 - i - len1)
                    sAfter2 = right(s2, m + 1 - j - len1)
                    iScore = SumOfCommonStrings(sAfter1, _
                                                sAfter2, _
                                                Compare, _
                                                iScore)
                End If
    
    
                SumOfCommonStrings = iScore
                Exit Function
    
            End If
    
        Next
    
    
    Next
    
    
    End Function
    
    
    Private Function Minimum(ByVal a As Integer, _
                             ByVal b As Integer, _
                             ByVal c As Integer) As Integer
    Dim min As Integer
    
      min = a
    
      If b < min Then
            min = b
      End If
    
      If c < min Then
            min = c
      End If
    
      Minimum = min
    
    End Function
    
    
    0 讨论(0)
  • 2020-12-12 13:50

    I recently had to solve the same problem. My solution involves scoring strings with consecutively matched letters highly and excluding strings that don't contain the typed letters in order.

    I've documented the algorithm in detail here: http://blog.kazade.co.uk/2014/10/a-fuzzy-filename-matching-algorithm.html

    0 讨论(0)
  • 2020-12-12 13:57

    Two algorithms I've found so far:

    1. LiquidMetal
    2. Better Ido Flex-Matching
    0 讨论(0)
  • 2020-12-12 13:58

    I'm actually building something similar to Vim's Command-T and ctrlp plugins for Emacs, just for fun. I have just had a productive discussion with some clever workmates about ways to do this most efficiently. The goal is to reduce the number of operations needed to eliminate files that don't match. So we create a nested map, where at the top-level each key is a character that appears somewhere in the search set, mapping to the indices of all the strings in the search set. Each of those indices then maps to a list of character offsets at which that particular character appears in the search string.

    In pseudo code, for the strings:

    • controller
    • model
    • view

    We'd build a map like this:

    {
      "c" => {
               0 => [0]
             },
      "o" => {
               0 => [1, 5],
               1 => [1]
             },
      "n" => {
               0 => [2]
             },
      "t" => {
               0 => [3]
             },
      "r" => {
               0 => [4, 9]
             },
      "l" => {
               0 => [6, 7],
               1 => [4]
             },
      "e" => {
               0 => [9],
               1 => [3],
               2 => [2]
             },
      "m" => {
               1 => [0]
             },
      "d" => {
               1 => [2]
             },
      "v" => {
               2 => [0]
             },
      "i" => {
               2 => [1]
             },
      "w" => {
               2 => [3]
             }
    }
    

    So now you have a mapping like this:

    {
      character-1 => {
        word-index-1 => [occurrence-1, occurrence-2, occurrence-n, ...],
        word-index-n => [ ... ],
        ...
      },
      character-n => {
        ...
      },
      ...
    }
    

    Now searching for the string "oe":

    1. Initialize a new map where the keys will be the indices of strings that match, and the values the offset read through that string so far.
    2. Consume the first char from the search string "o" and look it up in the lookup table.
    3. Since strings at indices 0 and 1 match the "o", put them into the map {0 => 1, 1 => 1}.
    4. Now search consume the next char in the input string, "e" and loo it up in the table.
    5. Here 3 strings match, but we know that we only care about strings 0 and 1.
    6. Check if there are any offsets > the current offsets. If not, eliminate the items from our map, otherwise update the offset: {0 => 9, 1 => 3}.

    Now by looking at the keys in our map that we've accumulated, we know which strings matched the fuzzy search.

    Ideally, if the search is being performed as the user types, you'll keep track of the accumulated hash of results and pass it back into your search function. I think this will be a lot faster than iterating all search strings and performing a full wildcard search on each one.

    The interesting thing about this is that you could also efficient store the Levenstein Distance along with each match, assuming you only care about insertions, not substitutions or deletions. Though perhaps it's not hard to get that logic added too.

    0 讨论(0)
提交回复
热议问题