Removing duplicates, keeping last entry — optimization

前端 未结 2 1311
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-25 10:06

I\'m working on a macro that will go through a spreadsheet and remove duplicate entries (rows) based on two criteria that are provided separately in two columns (columns Q and D

相关标签:
2条回答
  • 2021-01-25 10:35

    This procedure deletes all the duplicated rows identified by column D and Q. Among duplicates, it will keep the row the closest to the bottom of the sheet. Basically, an indexed column is created on the right to sort and move all the duplicated rows at the bottom so they can be deleted in a single call. Note that it doesn't alter the cells formula or format if there is any.

    Sub DeleteDuplicatedRows()
      Dim rgTable As Range, rgIndex As Range, dataColD(), dataColQ()
    
      Set rgTable = ActiveSheet.UsedRange
    
      ' load each column representing the identifier in an array
      dataColD = rgTable.Columns("D").value  ' load values from column D
      dataColQ = rgTable.Columns("Q").value  ' load values from column Q
    
      ' get each unique row number with a dictionary
      Dim dict As New VBA.collection, indexes(), r&, rr
      On Error Resume Next
      For r = UBound(dataColD) To 1 Step -1
        dict.Add r, dataColD(r, 1) & vbNullChar & dataColQ(r, 1)
      Next
      On Error GoTo 0
    
      ' index all the unique rows in an array
      ReDim indexes(1 To UBound(dataColD), 1 To 1)
      For Each rr In dict: indexes(rr, 1) = rr: Next
    
      ' insert the indexes in the last column on the right
      Set rgIndex = rgTable.Columns(rgTable.Columns.count + 1)
      rgIndex.value = indexes
    
      ' sort the rows on the indexes, duplicates will move at the end
      Union(rgTable, rgIndex).Sort key1:=rgIndex, Orientation:=xlTopToBottom, Header:=xlYes
    
      ' delete the index column on the right and the empty rows at the bottom
      rgIndex.EntireColumn.Delete
      rgTable.Resize(UBound(dataColD) - dict.count + 1).offset(dict.count).EntireRow.Delete
    
    End Sub
    
    0 讨论(0)
  • 2021-01-25 10:57

    100,00 rows × 87 columns in 40.3 seconds.

    If your data set(s) start at 30K rows and only get bigger you should be looking to in-memory processing whenever possible¹. I've adapted the methods used in this solution to more closely follow your requirements.

    The following bulk loads all values into a variant array and builds a Scripting.Dictionary object from the results. The 'overwrite' method of adding keys to the dictionary is used so that only the last one is kept.

    When the collation has been performed, the values are returned to a re-dimensioned variant array and restored to the worksheet en masse.

    Module1 (Code)

    Option Explicit
    
    Sub removeDupesKeepLast()
        Dim d As Long, dDQs As Object, ky As Variant
        Dim r As Long, c As Long, vVALs As Variant, vTMP As Variant
    
        'appTGGL bTGGL:=False   'uncomment this when you have finished debugging
    
        Set dDQs = CreateObject("Scripting.Dictionary")
        dDQs.comparemode = vbTextCompare
    
        'step 1 - bulk load the values
        With Worksheets("Sheet1")   'you should know what worksheet you are on
            With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
                With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
                    vVALs = .Value  'use .Value2 if you do not have dates in unformatted cells
                End With
            End With
        End With
    
        'step 2 - build the dictionary
        ReDim vTMP(UBound(vVALs, 2) - 1)
        For r = LBound(vVALs, 1) To UBound(vVALs, 1)
            For c = LBound(vVALs, 2) To UBound(vVALs, 2)
                vTMP(c - 1) = vVALs(r, c)
            Next c
            dDQs.Item(vVALs(r, 4) & ChrW(8203) & vVALs(r, 17)) = vTMP
        Next r
    
        'step 3 - put the de-duplicated values back into the array
        r = 0
        ReDim vVALs(1 To dDQs.Count, LBound(vVALs, 2) To UBound(vVALs, 2))
        For Each ky In dDQs
            r = r + 1
            vTMP = dDQs.Item(ky)
            For c = LBound(vTMP) To UBound(vTMP)
                vVALs(r, c + 1) = vTMP(c)
            Next c
        Next ky
    
        'step 4 - clear the destination; put the de-duplicated values back into the worksheet and reset .UsedRange
        With Worksheets("Sheet1")   'you should know what worksheet you are on
            With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
                With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
                    .ClearContents  'retain formatting if it is there
                    .Cells(1, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
                End With
            End With
            .UsedRange   'assert the UsedRange property (refreshes it)
        End With
    
        dDQs.RemoveAll: Set dDQs = Nothing
    
        appTGGL
    End Sub
    
    Public Sub appTGGL(Optional bTGGL As Boolean = True)
        With Application
            .ScreenUpdating = bTGGL
            .EnableEvents = bTGGL
            .DisplayAlerts = bTGGL
            .AutoRecover.Enabled = bTGGL   'no interruptions with an auto-save
            .Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
            .CutCopyMode = False
            .StatusBar = vbNullString
        End With
        Debug.Print Timer
    End Sub
    

    My sample workbook took 100K rows × 87 column with ~24% duplicates and processed all duplicates (keeping the last entries) in ~40 seconds. The above writes back to Sheet1; my tests were run writing back to Sheet2 in order to retain the original data. If you choose to write back to a different worksheet, make sure that there are some values starting at A1 in order that the Range.CurrentRegion property can be properly identified. The test machine was an older laptop running 32-bit Excel 2010; your own results will likely vary.


    ¹ See Highlight Duplicates and Filter by color alternative for tip[s on dealing with large data sets in Excel.

    0 讨论(0)
提交回复
热议问题