问题
I have the following data which I want to match and after going through several techniques, the most favorable seems to be Levenshtein distance method – would you agree with this approach based on the below data or would you recommend some other method that would be able to match the following better in high volumes?
The example of the data can be seen below:
**Column1** **Column2**
Modra Digest (DC) Oldstewart2
South West Local /Sunday Times (new) Oldstewart
OldStewart political print Saigon Last month Saigon Last month
Oldstewart2 Local print (Former) Modra Digest Velehrad Digest (DC) (used via Bembek)
Saigon Last month South West Local South West Local /Sunday Times
data input
Should I decide to go ahead with using the Levenshtein distance method (defined as function in VBA called Levenshtein, where the results are converted the results into percentage), I would like to tweak the application of this function a bit and have it ran as macro. The columns that I am matching (A and B) have a different number of inputs that differ in structure (i.e. even when alphabetically sorted the matched items won’t be next to one another. Would there be any way to do the following?
- Temporarily remove everything in the brackets and its content from both compared strings and TRIM the strings to remove empty space before and after the remaining string.
- Remove any duplicated words from each string.
- Find the Levenshtein match from (A and B) based on the highest percentage of similarity (in accordance with the defined function) and order accordingly in column A and B so the “best matching items” are placed next to one another.
- Show the original form (i.e. including brackets, word duplicates) in the structure described in point 3, however with no re-calculation involved (i.e. the matching percentage should be preserved from TRIMmed, de-duplicated and bracket-free form as mentioned in point 1 and 2)
- The list of records records should be sorted from most matching to least matching.
Lastly (this is where I find the chosen approach a bit questionable), since the above mentioned method counts the number of changes that need to be done to match the strings, what would be the best way (perhaps another layer of checking) to deal with examples like “oldsteward” vs “Oldstewart2 Local print (Former)”, as this would require the deletion of text which counts as a change (therefore lowers the similarity) and would therefore show low similarity according to Levenschtein method?
As for the use of levenschtein method, the output would look somewhat like this:
****Column1** **Column2** **Match(%)**
Modra Digest (DC) Modra Digest Velehrad Digest (DC) (used via Bembek) 63
South West Local /Sunday Times (new) South West Local South West Local /Sunday Times 64
OldStewart political print Oldstewart 38
Oldstewart2 Local print (Former) Oldstewart2 48
Saigon Last month Saigon Last month Saigon Last month 94
Data output
The function:
Function Levenshtein3(ByVal string1 As String, ByVal string2 As String) As Long
Dim i As Long, j As Long, string1_length As Long, string2_length As Long
Dim distance(0 To 90, 0 To 80) As Long, smStr1(1 To 90) As Long, smStr2(1 To 800) As Long
Dim min1 As Long, min2 As Long, min3 As Long, minmin As Long, MaxL As Long
string1_length = Len(string1): string2_length = Len(string2)
distance(0, 0) = 0
For i = 1 To string1_length: distance(i, 0) = i: smStr1(i) = Asc(LCase(Mid$(string1, i, 1))): Next
For j = 1 To string2_length: distance(0, j) = j: smStr2(j) = Asc(LCase(Mid$(string2, j, 1))): Next
For i = 1 To string1_length
For j = 1 To string2_length
If smStr1(i) = smStr2(j) Then
distance(i, j) = distance(i - 1, j - 1)
Else
min1 = distance(i - 1, j) + 1
min2 = distance(i, j - 1) + 1
min3 = distance(i - 1, j - 1) + 1
If min2 < min1 Then
If min2 < min3 Then minmin = min2 Else minmin = min3
Else
If min1 < min3 Then minmin = min1 Else minmin = min3
End If
distance(i, j) = minmin
End If
Next
Next
' Levenshtein3 will properly return a percent match (100%=exact) based on similarities and Lengths etc...
MaxL = string1_length: If string2_length > MaxL Then MaxL = string2_length
Levenshtein3 = 100 - CLng((distance(string1_length, string2_length) * 100) / MaxL)
End Function
- The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other.
Many thanks for your help.
Jay
来源:https://stackoverflow.com/questions/62856241/string-matching-in-vba-using-a-predefined-function