vba: get unique values from array

前端 未结 9 2101
北恋
北恋 2020-11-22 15:30

Is there any built-in functionality in vba to get unique values from a one-dimensional array? What about just getting rid of duplicates?

If not, then how would I get

9条回答
  •  遇见更好的自我
    2020-11-22 15:59

    Update (6/15/16)

    I have created much more thorough benchmarks. First of all, as @ChaimG pointed out, early binding makes a big difference (I originally used @eksortso's code above verbatim which uses late binding). Secondly, my original benchmarks only included the time to create the unique object, however, it did not test the efficiency of using the object. My point in doing this is, it doesn't really matter if I can create an object really fast if the object I create is clunky and slows me down moving forward.

    Old Remark: It turns out, that looping over a collection object is highly inefficient

    It turns out that looping over a collection can be quite efficient if you know how to do it (I didn't). As @ChaimG (yet again), pointed out in the comments, using a For Each construct is ridiculously superior to simply using a For loop. To give you an idea, before changing the loop construct, the time for Collection2 for the Test Case Size = 10^6 was over 1400s (i.e. ~23 minutes). It is now a meager 0.195s (over 7000x faster).

    For the Collection method there are two times. The first (my original benchmark Collection1) show the time to create the unique object. The second part (Collection2) shows the time to loop over the object (which is very natural) to create a returnable array as the other functions do.

    In the chart below, a yellow background indicates that it was the fastest for that test case, and red indicates the slowest ("Not Tested" algorithms are excluded). The total time for the Collection method is the sum of Collection1 and Collection2. Turquoise indicates that is was the fastest regardless of original order.

    Below is the original algorithm I created (I have modified it slightly e.g. I no longer instantiate my own data type). It returns the unique values of an array with the original order in a very respectable time and it can be modified to take on any data type. Outside of the IndexMethod, it is the fastest algorithm for very large arrays.

    Here are the main ideas behind this algorithm:

    1. Index the array
    2. Sort by values
    3. Place identical values at the end of the array and subsequently "chop" them off.
    4. Finally, sort by index.

    Below is an example:

    Let myArray = (86, 100, 33, 19, 33, 703, 19, 100, 703, 19)
    
        1.  (86, 100, 33, 19, 33, 703, 19, 100, 703, 19)
            (1 ,   2,  3,  4,  5,   6,  7,   8,   9, 10)   <<-- Indexing
    
        2.  (19, 19, 19, 33, 33, 86, 100, 100, 703, 703)   <<-- sort by values     
            (4,   7, 10,  3,  5,  1,   2,   8,   6,   9)
    
        3.  (19, 33,  86, 100, 703)   <<-- remove duplicates    
            (4,   3,   1,   2,   6)
    
        4.  (86, 100,  33, 19, 703)   
            ( 1,   2,   3,  4,   6)   <<-- sort by index
    

    Here is the code:

    Function SortingUniqueTest(ByRef myArray() As Long, bOrigIndex As Boolean) As Variant
        Dim MyUniqueArr() As Long, i As Long, intInd As Integer
        Dim StrtTime As Double, Endtime As Double, HighB As Long, LowB As Long
    
        LowB = LBound(myArray): HighB = UBound(myArray)
    
        ReDim MyUniqueArr(1 To 2, LowB To HighB)
        intInd = 1 - LowB  'Guarantees the indices span 1 to Lim
    
        For i = LowB To HighB
            MyUniqueArr(1, i) = myArray(i)
            MyUniqueArr(2, i) = i + intInd
        Next i
    
        QSLong2D MyUniqueArr, 1, LBound(MyUniqueArr, 2), UBound(MyUniqueArr, 2), 2
        Call UniqueArray2D(MyUniqueArr)
        If bOrigIndex Then QSLong2D MyUniqueArr, 2, LBound(MyUniqueArr, 2), UBound(MyUniqueArr, 2), 2
    
        SortingUniqueTest = MyUniqueArr()
    End Function
    
    Public Sub UniqueArray2D(ByRef myArray() As Long)
        Dim i As Long, j As Long, Count As Long, Count1 As Long, DuplicateArr() As Long
        Dim lngTemp As Long, HighB As Long, LowB As Long
        LowB = LBound(myArray, 2): Count = LowB: i = LowB: HighB = UBound(myArray, 2)
    
        Do While i < HighB
            j = i + 1
            If myArray(1, i) = myArray(1, j) Then
                Do While myArray(1, i) = myArray(1, j)
                    ReDim Preserve DuplicateArr(1 To Count)
                    DuplicateArr(Count) = j
                    Count = Count + 1
                    j = j + 1
                    If j > HighB Then Exit Do
                Loop
    
                QSLong2D myArray, 2, i, j - 1, 2
            End If
            i = j
        Loop
    
        Count1 = HighB
    
        If Count > 1 Then
            For i = UBound(DuplicateArr) To LBound(DuplicateArr) Step -1
                myArray(1, DuplicateArr(i)) = myArray(1, Count1)
                myArray(2, DuplicateArr(i)) = myArray(2, Count1)
                Count1 = Count1 - 1
                ReDim Preserve myArray(1 To 2, LowB To Count1)
            Next i
        End If
    End Sub
    

    Here is the sorting algorithm I use (more about this algo here).

    Sub QSLong2D(ByRef saArray() As Long, bytDim As Byte, lLow1 As Long, lHigh1 As Long, bytNum As Byte)
        Dim lLow2 As Long, lHigh2 As Long
        Dim sKey As Long, sSwap As Long, i As Byte
    
    On Error GoTo ErrorExit
    
        If IsMissing(lLow1) Then lLow1 = LBound(saArray, bytDim)
        If IsMissing(lHigh1) Then lHigh1 = UBound(saArray, bytDim)
        lLow2 = lLow1
        lHigh2 = lHigh1
    
        sKey = saArray(bytDim, (lLow1 + lHigh1) \ 2)
    
        Do While lLow2 < lHigh2
            Do While saArray(bytDim, lLow2) < sKey And lLow2 < lHigh1: lLow2 = lLow2 + 1: Loop
            Do While saArray(bytDim, lHigh2) > sKey And lHigh2 > lLow1: lHigh2 = lHigh2 - 1: Loop
    
            If lLow2 < lHigh2 Then
                For i = 1 To bytNum
                    sSwap = saArray(i, lLow2)
                    saArray(i, lLow2) = saArray(i, lHigh2)
                    saArray(i, lHigh2) = sSwap
                Next i
            End If
    
            If lLow2 <= lHigh2 Then
                lLow2 = lLow2 + 1
                lHigh2 = lHigh2 - 1
            End If
        Loop
    
        If lHigh2 > lLow1 Then QSLong2D saArray(), bytDim, lLow1, lHigh2, bytNum
        If lLow2 < lHigh1 Then QSLong2D saArray(), bytDim, lLow2, lHigh1, bytNum
    
    ErrorExit:
    
    End Sub
    

    Below is a special algorithm that is blazing fast if your data contains integers. It makes use of indexing and the Boolean data type.

    Function IndexSort(ByRef myArray() As Long, bOrigIndex As Boolean) As Variant
    '' Modified to take both positive and negative integers
        Dim arrVals() As Long, arrSort() As Long, arrBool() As Boolean
        Dim i As Long, HighB As Long, myMax As Long, myMin As Long, OffSet As Long
        Dim LowB As Long, myIndex As Long, count As Long, myRange As Long
    
        HighB = UBound(myArray)
        LowB = LBound(myArray)
    
        For i = LowB To HighB
            If myArray(i) > myMax Then myMax = myArray(i)
            If myArray(i) < myMin Then myMin = myArray(i)
        Next i
    
        OffSet = Abs(myMin)  '' Number that will be added to every element
                             '' to guarantee every index is non-negative
    
        If myMax > 0 Then
            myRange = myMax + OffSet  '' E.g. if myMax = 10 & myMin = -2, then myRange = 12
        Else
            myRange = OffSet
        End If
    
        If bOrigIndex Then
            ReDim arrSort(1 To 2, 1 To HighB)
            ReDim arrVals(1 To 2, 0 To myRange)
            ReDim arrBool(0 To myRange)
    
            For i = LowB To HighB
                myIndex = myArray(i) + OffSet
                arrBool(myIndex) = True
                arrVals(1, myIndex) = myArray(i)
                If arrVals(2, myIndex) = 0 Then arrVals(2, myIndex) = i
            Next i
    
            For i = 0 To myRange
                If arrBool(i) Then
                    count = count + 1
                    arrSort(1, count) = arrVals(1, i)
                    arrSort(2, count) = arrVals(2, i)
                End If
            Next i
    
            QSLong2D arrSort, 2, 1, count, 2
            ReDim Preserve arrSort(1 To 2, 1 To count)
        Else
            ReDim arrSort(1 To HighB)
            ReDim arrVals(0 To myRange)
            ReDim arrBool(0 To myRange)
    
            For i = LowB To HighB
                myIndex = myArray(i) + OffSet
                arrBool(myIndex) = True
                arrVals(myIndex) = myArray(i)
            Next i
    
            For i = 0 To myRange
                If arrBool(i) Then
                    count = count + 1
                    arrSort(count) = arrVals(i)
                End If
            Next i
    
            ReDim Preserve arrSort(1 To count)
        End If
    
        ReDim arrVals(0)
        ReDim arrBool(0)
    
        IndexSort = arrSort
    End Function
    

    Here are the Collection (by @DocBrown) and Dictionary (by @eksortso) Functions.

    Function CollectionTest(ByRef arrIn() As Long, Lim As Long) As Variant
        Dim arr As New Collection, a, i As Long, arrOut() As Variant, aFirstArray As Variant
        Dim StrtTime As Double, EndTime1 As Double, EndTime2 As Double, count As Long
    On Error Resume Next
    
        ReDim arrOut(1 To UBound(arrIn))
        ReDim aFirstArray(1 To UBound(arrIn))
    
        StrtTime = Timer
        For i = 1 To UBound(arrIn): aFirstArray(i) = CStr(arrIn(i)): Next i '' Convert to string
        For Each a In aFirstArray               ''' This part is actually creating the unique set
            arr.Add a, a
        Next
        EndTime1 = Timer - StrtTime
    
        StrtTime = Timer         ''' This part is writing back to an array for return
        For Each a In arr: count = count + 1: arrOut(count) = a: Next a
        EndTime2 = Timer - StrtTime
        CollectionTest = Array(arrOut, EndTime1, EndTime2)
    End Function
    
    Function DictionaryTest(ByRef myArray() As Long, Lim As Long) As Variant
        Dim StrtTime As Double, Endtime As Double
        Dim d As Scripting.Dictionary, i As Long  '' Early Binding
        Set d = New Scripting.Dictionary
        For i = LBound(myArray) To UBound(myArray): d(myArray(i)) = 1: Next i
        DictionaryTest = d.Keys()
    End Function
    

    Here is the Direct approach provided by @IsraelHoletz.

    Function ArrayUnique(ByRef aArrayIn() As Long) As Variant
        Dim aArrayOut() As Variant, bFlag As Boolean, vIn As Variant, vOut As Variant
        Dim i As Long, j As Long, k As Long
        ReDim aArrayOut(LBound(aArrayIn) To UBound(aArrayIn))
        i = LBound(aArrayIn)
        j = i
    
        For Each vIn In aArrayIn
            For k = j To i - 1
                If vIn = aArrayOut(k) Then bFlag = True: Exit For
            Next
            If Not bFlag Then aArrayOut(i) = vIn: i = i + 1
            bFlag = False
        Next
    
        If i <> UBound(aArrayIn) Then ReDim Preserve aArrayOut(LBound(aArrayIn) To i - 1)
        ArrayUnique = aArrayOut
    End Function
    
    Function DirectTest(ByRef aArray() As Long, Lim As Long) As Variant
        Dim aReturn() As Variant
        Dim StrtTime As Long, Endtime As Long, i As Long
        aReturn = ArrayUnique(aArray)
        DirectTest = aReturn
    End Function
    

    Here is the benchmark function that compares all of the functions. You should note that the last two cases are handled a little bit different because of memory issues. Also note, that I didn't test the Collection method for the Test Case Size = 10,000,000. For some reason, it was returning incorrect results and behaving unusual (I'm guessing the collection object has a limit on how many things you can put in it. I searched and I couldn't find any literature on this).

    Function UltimateTest(Lim As Long, bTestDirect As Boolean, bTestDictionary, bytCase As Byte) As Variant
    
        Dim dictionTest, collectTest, sortingTest1, indexTest1, directT '' all variants
        Dim arrTest() As Long, i As Long, bEquality As Boolean, SizeUnique As Long
        Dim myArray() As Long, StrtTime As Double, EndTime1 As Variant
        Dim EndTime2 As Double, EndTime3 As Variant, EndTime4 As Double
        Dim EndTime5 As Double, EndTime6 As Double, sortingTest2, indexTest2
    
        ReDim myArray(1 To Lim): Rnd (-2)   '' If you want to test negative numbers, 
        '' insert this to the left of CLng(Int(Lim... : (-1) ^ (Int(2 * Rnd())) *
        For i = LBound(myArray) To UBound(myArray): myArray(i) = CLng(Int(Lim * Rnd() + 1)): Next i
        arrTest = myArray
    
        If bytCase = 1 Then
            If bTestDictionary Then
                StrtTime = Timer: dictionTest = DictionaryTest(arrTest, Lim): EndTime1 = Timer - StrtTime
            Else
                EndTime1 = "Not Tested"
            End If
    
            arrTest = myArray
            collectTest = CollectionTest(arrTest, Lim)
    
            arrTest = myArray
            StrtTime = Timer: sortingTest1 = SortingUniqueTest(arrTest, True): EndTime2 = Timer - StrtTime
            SizeUnique = UBound(sortingTest1, 2)
    
            If bTestDirect Then
                arrTest = myArray: StrtTime = Timer: directT = DirectTest(arrTest, Lim): EndTime3 = Timer - StrtTime
            Else
                EndTime3 = "Not Tested"
            End If
    
            arrTest = myArray
            StrtTime = Timer: indexTest1 = IndexSort(arrTest, True): EndTime4 = Timer - StrtTime
    
            arrTest = myArray
            StrtTime = Timer: sortingTest2 = SortingUniqueTest(arrTest, False): EndTime5 = Timer - StrtTime
    
            arrTest = myArray
            StrtTime = Timer: indexTest2 = IndexSort(arrTest, False): EndTime6 = Timer - StrtTime
    
            bEquality = True
            For i = LBound(sortingTest1, 2) To UBound(sortingTest1, 2)
                If Not CLng(collectTest(0)(i)) = sortingTest1(1, i) Then
                    bEquality = False
                    Exit For
                End If
            Next i
    
            For i = LBound(dictionTest) To UBound(dictionTest)
                If Not dictionTest(i) = sortingTest1(1, i + 1) Then
                    bEquality = False
                    Exit For
                End If
            Next i
    
            For i = LBound(dictionTest) To UBound(dictionTest)
                If Not dictionTest(i) = indexTest1(1, i + 1) Then
                    bEquality = False
                    Exit For
                End If
            Next i
    
            If bTestDirect Then
                For i = LBound(dictionTest) To UBound(dictionTest)
                    If Not dictionTest(i) = directT(i + 1) Then
                        bEquality = False
                        Exit For
                    End If
                Next i
            End If
    
            UltimateTest = Array(bEquality, EndTime1, EndTime2, EndTime3, EndTime4, _
                            EndTime5, EndTime6, collectTest(1), collectTest(2), SizeUnique)
        ElseIf bytCase = 2 Then
            arrTest = myArray
            collectTest = CollectionTest(arrTest, Lim)
            UltimateTest = Array(collectTest(1), collectTest(2))
        ElseIf bytCase = 3 Then
            arrTest = myArray
            StrtTime = Timer: sortingTest1 = SortingUniqueTest(arrTest, True): EndTime2 = Timer - StrtTime
            SizeUnique = UBound(sortingTest1, 2)
            UltimateTest = Array(EndTime2, SizeUnique)
        ElseIf bytCase = 4 Then
            arrTest = myArray
            StrtTime = Timer: indexTest1 = IndexSort(arrTest, True): EndTime4 = Timer - StrtTime
            UltimateTest = EndTime4
        ElseIf bytCase = 5 Then
            arrTest = myArray
            StrtTime = Timer: sortingTest2 = SortingUniqueTest(arrTest, False): EndTime5 = Timer - StrtTime
            UltimateTest = EndTime5
        ElseIf bytCase = 6 Then
            arrTest = myArray
            StrtTime = Timer: indexTest2 = IndexSort(arrTest, False): EndTime6 = Timer - StrtTime
            UltimateTest = EndTime6
        End If
    
    End Function
    

    And finally, here is the sub that produces the table above.

    Sub GetBenchmarks()
        Dim myVar, i As Long, TestCases As Variant, j As Long, temp
    
        TestCases = Array(1000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000, 2000000, 5000000, 10000000)
    
        For j = 0 To 11
            If j < 6 Then
                myVar = UltimateTest(CLng(TestCases(j)), True, True, 1)
            ElseIf j < 10 Then
                myVar = UltimateTest(CLng(TestCases(j)), False, True, 1)
            ElseIf j < 11 Then
                myVar = Array("Not Tested", "Not Tested", 0.1, "Not Tested", 0.1, 0.1, 0.1, 0, 0, 0)
                temp = UltimateTest(CLng(TestCases(j)), False, False, 2)
                myVar(7) = temp(0): myVar(8) = temp(1)
                temp = UltimateTest(CLng(TestCases(j)), False, False, 3)
                myVar(2) = temp(0): myVar(9) = temp(1)
                myVar(4) = UltimateTest(CLng(TestCases(j)), False, False, 4)
                myVar(5) = UltimateTest(CLng(TestCases(j)), False, False, 5)
                myVar(6) = UltimateTest(CLng(TestCases(j)), False, False, 6)
            Else
                myVar = Array("Not Tested", "Not Tested", 0.1, "Not Tested", 0.1, 0.1, 0.1, "Not Tested", "Not Tested", 0)
                temp = UltimateTest(CLng(TestCases(j)), False, False, 3)
                myVar(2) = temp(0): myVar(9) = temp(1)
                myVar(4) = UltimateTest(CLng(TestCases(j)), False, False, 4)
                myVar(5) = UltimateTest(CLng(TestCases(j)), False, False, 5)
                myVar(6) = UltimateTest(CLng(TestCases(j)), False, False, 6)
            End If
    
            Cells(4 + j, 6) = TestCases(j)
            For i = 1 To 9: Cells(4 + j, 6 + i) = myVar(i - 1): Next i
            Cells(4 + j, 17) = myVar(9)
        Next j
    End Sub
    

    Summary
    From the table of results, we can see that the Dictionary method works really well for cases less than about 500,000, however, after that, the IndexMethod really starts to dominate. You will notice that when order doesn't matter and your data is made up of positive integers, there is no comparison to the IndexMethod algorithm (it returns the unique values from an array containing 10 million elements in less than 1 sec!!! Incredible!). Below I have a breakdown of which algorithm is preferred in various cases.

    Case 1
    Your Data contains integers (i.e. whole numbers, both positive and negative): IndexMethod

    Case 2
    Your Data contains non-integers (i.e. variant, double, string, etc.) with less than 200000 elements: Dictionary Method

    Case 3
    Your Data contains non-integers (i.e. variant, double, string, etc.) with more than 200000 elements: Collection Method

    If you had to choose one algorithm, in my opinion, the Collection method is still the best as it only requires a few lines of code, it's super general, and it's fast enough.

提交回复
热议问题