String matching in VBA using a predefined function

问题

I have the following data which I want to match and after going through several techniques, the most favorable seems to be Levenshtein distance method – would you agree with this approach based on the below data or would you recommend some other method that would be able to match the following better in high volumes?

The example of the data can be seen below:

**Column1**                                               **Column2**
Modra Digest (DC)                                     Oldstewart2
South West Local /Sunday Times (new)                  Oldstewart
OldStewart political print  Saigon                    Last month Saigon Last month
Oldstewart2 Local print (Former)                      Modra Digest Velehrad Digest (DC) (used via Bembek) 
Saigon Last month                                     South West Local South West Local /Sunday Times

data input

Should I decide to go ahead with using the Levenshtein distance method (defined as function in VBA called Levenshtein, where the results are converted the results into percentage), I would like to tweak the application of this function a bit and have it ran as macro. The columns that I am matching (A and B) have a different number of inputs that differ in structure (i.e. even when alphabetically sorted the matched items won’t be next to one another. Would there be any way to do the following?

Temporarily remove everything in the brackets and its content from both compared strings and TRIM the strings to remove empty space before and after the remaining string.
Remove any duplicated words from each string.
Find the Levenshtein match from (A and B) based on the highest percentage of similarity (in accordance with the defined function) and order accordingly in column A and B so the “best matching items” are placed next to one another.
Show the original form (i.e. including brackets, word duplicates) in the structure described in point 3, however with no re-calculation involved (i.e. the matching percentage should be preserved from TRIMmed, de-duplicated and bracket-free form as mentioned in point 1 and 2)
The list of records records should be sorted from most matching to least matching.

Lastly (this is where I find the chosen approach a bit questionable), since the above mentioned method counts the number of changes that need to be done to match the strings, what would be the best way (perhaps another layer of checking) to deal with examples like “oldsteward” vs “Oldstewart2 Local print (Former)”, as this would require the deletion of text which counts as a change (therefore lowers the similarity) and would therefore show low similarity according to Levenschtein method?

As for the use of levenschtein method, the output would look somewhat like this:

****Column1**                       **Column2**                                          **Match(%)**
Modra Digest (DC)                       Modra Digest Velehrad Digest (DC) (used via Bembek)     63
South West Local /Sunday Times (new)    South West Local South West Local /Sunday Times         64
OldStewart political print              Oldstewart                                              38
Oldstewart2 Local print (Former)        Oldstewart2                                             48
Saigon Last month                       Saigon Last month Saigon Last month                     94

Data output

The function:

Function Levenshtein3(ByVal string1 As String, ByVal string2 As String) As Long

Dim i As Long, j As Long, string1_length As Long, string2_length As Long
Dim distance(0 To 90, 0 To 80) As Long, smStr1(1 To 90) As Long, smStr2(1 To 800) As Long
Dim min1 As Long, min2 As Long, min3 As Long, minmin As Long, MaxL As Long

string1_length = Len(string1):  string2_length = Len(string2)

distance(0, 0) = 0
For i = 1 To string1_length:    distance(i, 0) = i: smStr1(i) = Asc(LCase(Mid$(string1, i, 1))): Next
For j = 1 To string2_length:    distance(0, j) = j: smStr2(j) = Asc(LCase(Mid$(string2, j, 1))): Next
For i = 1 To string1_length
    For j = 1 To string2_length
        If smStr1(i) = smStr2(j) Then
            distance(i, j) = distance(i - 1, j - 1)
        Else
            min1 = distance(i - 1, j) + 1
            min2 = distance(i, j - 1) + 1
            min3 = distance(i - 1, j - 1) + 1
            If min2 < min1 Then
                If min2 < min3 Then minmin = min2 Else minmin = min3
            Else
                If min1 < min3 Then minmin = min1 Else minmin = min3
            End If
            distance(i, j) = minmin
        End If
    Next
Next

' Levenshtein3 will properly return a percent match (100%=exact) based on similarities and Lengths etc...
MaxL = string1_length: If string2_length > MaxL Then MaxL = string2_length
Levenshtein3 = 100 - CLng((distance(string1_length, string2_length) * 100) / MaxL)

End Function

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other.

Many thanks for your help.

Jay

来源：https://stackoverflow.com/questions/62856241/string-matching-in-vba-using-a-predefined-function

标签

vba

automation

string-matching