问题
I have a list of items and I want to identify their similarity in relation to the other items in this list.
My desired output would be something along the lines of:
The percentage shown in the similarity column is purely illustrative. I'm thinking that a test for similarity would be something along the lines of:
number of concurrent letters / by the total number of letters in the matched item
But would be keen to get opinions on that one.
Is this something which is reasonably doable on Excel? I'ts a small data set (140kb) containing only alphanumeric values.
Am also open to alternative ways of approaching this, as I haven't tackled anything like this before!
P.s. I've been learning Python for a few months now, so suggestions using Python would also be good!
回答1:
Here is a solution using a VBA UDF:
EDIT: Added a new optional argument named arg_lMinConsecutive
which is used to determine the minimum number of consecutive characters that must match. Note the extra argument 2
in the below formulas which indicates that at least 2 consecutive characters must match.
Public Function FuzzyMatch(ByVal arg_sText As String, _
ByVal arg_vList As Variant, _
ByVal arg_lOutput As Long, _
Optional ByVal arg_lMinConsecutive As Long = 1, _
Optional ByVal arg_bMatchCase As Boolean = True, _
Optional ByVal arg_bExactCount As Boolean = True) _
As Variant
Dim dExactCounts As Object
Dim aResults() As Variant
Dim vList As Variant
Dim vListItem As Variant
Dim sLetter As String
Dim dMaxMatch As Double
Dim lMaxIndex As Long
Dim lResultIndex As Long
Dim lLastMatch As Long
Dim i As Long
Dim bMatch As Boolean
If arg_lMinConsecutive <= 0 Then
FuzzyMatch = CVErr(xlErrNum)
Exit Function
End If
If arg_bExactCount = True Then Set dExactCounts = CreateObject("Scripting.Dictionary")
If TypeName(arg_vList) = "Collection" Or TypeName(arg_vList) = "Range" Then
ReDim aResults(1 To arg_vList.Count, 1 To 3)
Set vList = arg_vList
ElseIf IsArray(arg_vList) Then
ReDim aResults(1 To UBound(arg_vList) - LBound(arg_vList) + 1, 1 To 3)
vList = arg_vList
Else
ReDim vList(1 To 1)
vList(1) = arg_vList
ReDim aResults(1 To 1, 1 To 3)
End If
dMaxMatch = 0#
lMaxIndex = 0
lResultIndex = 0
For Each vListItem In vList
If vListItem <> arg_sText Then
lLastMatch = -arg_lMinConsecutive
lResultIndex = lResultIndex + 1
aResults(lResultIndex, 3) = vListItem
If arg_bExactCount Then dExactCounts.RemoveAll
For i = 1 To Len(arg_sText) - arg_lMinConsecutive + 1
bMatch = False
sLetter = Mid(arg_sText, i, arg_lMinConsecutive)
If Not arg_bMatchCase Then sLetter = LCase(sLetter)
If arg_bExactCount Then dExactCounts(sLetter) = dExactCounts(sLetter) + 1
Select Case Abs(arg_bMatchCase) + Abs(arg_bExactCount) * 2
Case 0
'MatchCase is false and ExactCount is false
If InStr(1, vListItem, sLetter, vbTextCompare) > 0 Then bMatch = True
Case 1
'MatchCase is true and ExactCount is false
If InStr(1, vListItem, sLetter) > 0 Then bMatch = True
Case 2
'MatchCase is false and ExactCount is true
If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString, Compare:=vbTextCompare)) >= dExactCounts(sLetter) Then bMatch = True
Case 3
'MatchCase is true and ExactCount is true
If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString)) >= dExactCounts(sLetter) Then bMatch = True
End Select
If bMatch Then
aResults(lResultIndex, 1) = aResults(lResultIndex, 1) + WorksheetFunction.Min(arg_lMinConsecutive, i - lLastMatch)
lLastMatch = i
End If
Next i
If Len(vListItem) > 0 Then
aResults(lResultIndex, 2) = aResults(lResultIndex, 1) / Len(vListItem)
If aResults(lResultIndex, 2) > dMaxMatch Then
dMaxMatch = aResults(lResultIndex, 2)
lMaxIndex = lResultIndex
End If
Else
aResults(lResultIndex, 2) = 0
End If
End If
Next vListItem
If dMaxMatch = 0# Then
Select Case arg_lOutput
Case 1: FuzzyMatch = 0
Case 2: FuzzyMatch = vbNullString
Case Else: FuzzyMatch = CVErr(xlErrNum)
End Select
Else
Select Case arg_lOutput
Case 1: FuzzyMatch = Application.Min(1, aResults(lMaxIndex, 2))
Case 2: FuzzyMatch = aResults(lMaxIndex, 3)
Case Else: FuzzyMatch = CVErr(xlErrNum)
End Select
End If
End Function
Using only the original data in columns A and B, you can use this UDF to get the desired results in columns C and D:
In cell C2 and copied down is this formula:
=FuzzyMatch($B2,$B$2:$B$6,COLUMN(A2),2)
In cell D2 and copied down is this formula:
=IFERROR(INDEX(A:A,MATCH(FuzzyMatch($B2,$B$2:$B$6,COLUMN(B2),2),B:B,0)),"-")
Note that they both use the FuzzyMatch
UDF.
回答2:
In python you can use Levenshtein distance to get the results. Check out this answer:
Fuzzy string comparison in Python, confused with which library to use
回答3:
I really did not get the whole logic, but if you need the logic for the 100% here is it:
Option Explicit
Sub TestMe()
Dim rngCell As Range
Dim rngCell2 As Range
Dim lngTotal As Long
Dim lngTotal2 As Long
Dim lngCount As Long
For Each rngCell In Sheets(1).Range("A1:A5")
For Each rngCell2 In Sheets(1).Range("A1:A5")
If rngCell.Address <> rngCell2.Address Then
If InStr(1, rngCell, rngCell2) Then
rngCell.Offset(0, 1) = 1
Else
If InStr(1, rngCell2, rngCell) Then
rngCell.Offset(0, 2) = Round(CDbl(Len(rngCell) / Len(rngCell2)), 2)
End If
End If
End If
Next rngCell2
Next rngCell
End Sub
Here you go with the pic:
来源:https://stackoverflow.com/questions/43873223/text-similarity-analysis-excel