Unique Count Formula for large dataset

让人想犯罪 __ 提交于 2019-12-12 06:00:40

问题


I am having trouble determining a way to enter a 1 or 0 into an adjacent cell to indicate whether or not a value is unique when working with a large dataset. I have read of multiple methods for accomplishing this, however none of them seem efficient for my purposes: I am using an instance of Excel 2010 (so I do not have the Distinct Count feature in PivotTables, and when I try to use PowerPivot it crashes my file due to processing limitations.

In this StackOverflow question: Simple Pivot Table to Count Unique Values there are suggestions to use SUMPRODUCT or COUNTIF, but when working with 50,000+ rows as I am, this causes terrible performance and a file size of ~35 MB instead of ~3 MB. I wanted to know if there is a better solution for a large, dynamic dataset whether it is a formula or VBA.

An example of what I would like to accomplish is (with the Unique column being the adjacent cell):

Name   Week   Unique
John   1      1
Sally  1      1
John   1      0
Sally  2      1

I attempted to script the same functionality of COUNTIF but with no success:

For Each Cell In ThisWorkbook.Worksheets("Overtime & Type Data").Range("Z2:Z" & DataLastRow)
If Worksheets("Overtime & Type Data").Cells(Cell.Row, 26) <> Worksheets("Overtime & Type Data").Cells(Cell.Row - 1, 26) Then
FirstCell = Cell.Row
End If
If (Worksheets("Overtime & Type Data").Range(Cells(FirstCell, 26), Cells(Cell.Row, 26)) = Worksheets("Overtime & Type Data").Range(Cells(Cell.Row, 26))) = True Then
    Cell.Value = 1
Else
    Cell.Value = 0
End If
Next Cell

回答1:


This code ran on over 130,000 rows successfully in less than 3 seconds. Adjust the column letters to fit your dataset.

Sub tgr()

    Const colName As String = "A"
    Const colWeek As String = "B"
    Const colOutput As String = "C"

    Dim ws As Worksheet
    Dim rngData As Range
    Dim DataCell As Range
    Dim rngFound As Range
    Dim collUniques As Collection
    Dim arrResults() As Long
    Dim ResultIndex As Long
    Dim UnqCount As Long

    Set ws = ThisWorkbook.Sheets("Overtime & Type Data")
    Set rngData = ws.Range(colName & 2, ws.Cells(Rows.Count, colName).End(xlUp))
    Set collUniques = New Collection
    ReDim arrResults(1 To rngData.Cells.Count, 1 To 1)

    On Error Resume Next
    For Each DataCell In rngData.Cells
        ResultIndex = ResultIndex + 1
        collUniques.Add ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value, ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value
        If collUniques.Count > UnqCount Then
            UnqCount = collUniques.Count
            arrResults(ResultIndex, 1) = 1
        Else
            arrResults(ResultIndex, 1) = 0
        End If
    Next DataCell
    On Error GoTo 0

    ws.Cells(rngData.Row, colOutput).Resize(rngData.Cells.Count).Value = arrResults

End Sub



回答2:


One approach is to sort by Name and Week. Then you can determine Unique for any row by comparing with the previous row.

If you need to preserve the order, you could first write a column of Index numbers (1, 2, 3, ...) to keep track of order. After calculating Unique, sort by Index to restore the original order.

The whole process could be done manually with relatively few steps, or automated with VBA.




回答3:


I'm not sure how well this will work with 50000 values, but it goes through ~1500 in about a second.

Sub unique()
    Dim myColl As New Collection
    Dim isDup As Boolean
    Dim myValue As String
    Dim r As Long

    On Error GoTo DuplicateValue
    For r = 1 To Sheet1.UsedRange.Rows.Count
        isDup = False
        'Combine the value of the 2 cells together
        ' and add that string to our collection
        'If it is already in the collection it errors
        myValue = Sheet1.Cells(r, 1).Value & Sheet1.Cells(r, 2).Value
        myColl.Add r, myValue
        If isDup Then
            Sheet1.Cells(r, 3).Value = "0"
        Else
            Sheet1.Cells(r, 3).Value = "1"
        End If
    Next
    On Error GoTo 0
    Exit Sub
DuplicateValue:
    'The value is already in the collection so put a 0
    isDup = True
    Resume Next
End Sub



回答4:


Just about any bulk operation will beat a loop involving worksheet cells. You might be able to trim the time down a bit by performing all of the calculations in memory and only returning the values back to the worksheet en masse when it is complete.

Sub is_a_dupe()
    Dim v As Long, vTMP As Variant, vUNQs As Variant, dUNQs As Object

    Debug.Print Timer
    On Error GoTo bm_Uh_Oh
    Set dUNQs = CreateObject("Scripting.Dictionary")

    With Worksheets("Sheet1")

        vTMP = .Range(.Cells(2, 1), .Cells(Rows.Count, 2).End(xlUp)).Value2
        ReDim vUNQs(1 To UBound(vTMP, 1), 1 To 1)

        For v = LBound(vTMP, 1) To UBound(vTMP, 1)
            If dUNQs.Exists(Join(Array(vTMP(v, 1), vTMP(v, 2)))) Then
                vUNQs(v, 1) = 0
            Else
                dUNQs.Add Key:=Join(Array(vTMP(v, 1), vTMP(v, 2))), _
                          Item:=vTMP(v, 2)
                vUNQs(v, 1) = 1
            End If
        Next v

        .Cells(2, 3).Resize(UBound(vUNQs, 1), 1) = vUNQs

    End With

    Debug.Print Timer

bm_Uh_Oh:
    dUNQs.RemoveAll
    Set dUNQs = Nothing
End Sub

Previous experience tells me that the variety of data (as well as hardware, etc) will impact timing the process but in my random sample data I received these elapsed times.

 50K records ..... 0.53 seconds
130K records .... 1.32 seconds
500K records .... 4.92 seconds



来源:https://stackoverflow.com/questions/32125625/unique-count-formula-for-large-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!