I\'m working on a macro that will go through a spreadsheet and remove duplicate entries (rows) based on two criteria that are provided separately in two columns (columns Q and D
This procedure deletes all the duplicated rows identified by column D and Q. Among duplicates, it will keep the row the closest to the bottom of the sheet. Basically, an indexed column is created on the right to sort and move all the duplicated rows at the bottom so they can be deleted in a single call. Note that it doesn't alter the cells formula or format if there is any.
Sub DeleteDuplicatedRows()
Dim rgTable As Range, rgIndex As Range, dataColD(), dataColQ()
Set rgTable = ActiveSheet.UsedRange
' load each column representing the identifier in an array
dataColD = rgTable.Columns("D").value ' load values from column D
dataColQ = rgTable.Columns("Q").value ' load values from column Q
' get each unique row number with a dictionary
Dim dict As New VBA.collection, indexes(), r&, rr
On Error Resume Next
For r = UBound(dataColD) To 1 Step -1
dict.Add r, dataColD(r, 1) & vbNullChar & dataColQ(r, 1)
Next
On Error GoTo 0
' index all the unique rows in an array
ReDim indexes(1 To UBound(dataColD), 1 To 1)
For Each rr In dict: indexes(rr, 1) = rr: Next
' insert the indexes in the last column on the right
Set rgIndex = rgTable.Columns(rgTable.Columns.count + 1)
rgIndex.value = indexes
' sort the rows on the indexes, duplicates will move at the end
Union(rgTable, rgIndex).Sort key1:=rgIndex, Orientation:=xlTopToBottom, Header:=xlYes
' delete the index column on the right and the empty rows at the bottom
rgIndex.EntireColumn.Delete
rgTable.Resize(UBound(dataColD) - dict.count + 1).offset(dict.count).EntireRow.Delete
End Sub
100,00 rows × 87 columns in 40.3 seconds.
If your data set(s) start at 30K rows and only get bigger you should be looking to in-memory processing whenever possible¹. I've adapted the methods used in this solution to more closely follow your requirements.
The following bulk loads all values into a variant array and builds a Scripting.Dictionary object from the results. The 'overwrite' method of adding keys to the dictionary is used so that only the last one is kept.
When the collation has been performed, the values are returned to a re-dimensioned variant array and restored to the worksheet en masse.
Module1 (Code)
Option Explicit
Sub removeDupesKeepLast()
Dim d As Long, dDQs As Object, ky As Variant
Dim r As Long, c As Long, vVALs As Variant, vTMP As Variant
'appTGGL bTGGL:=False 'uncomment this when you have finished debugging
Set dDQs = CreateObject("Scripting.Dictionary")
dDQs.comparemode = vbTextCompare
'step 1 - bulk load the values
With Worksheets("Sheet1") 'you should know what worksheet you are on
With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
vVALs = .Value 'use .Value2 if you do not have dates in unformatted cells
End With
End With
End With
'step 2 - build the dictionary
ReDim vTMP(UBound(vVALs, 2) - 1)
For r = LBound(vVALs, 1) To UBound(vVALs, 1)
For c = LBound(vVALs, 2) To UBound(vVALs, 2)
vTMP(c - 1) = vVALs(r, c)
Next c
dDQs.Item(vVALs(r, 4) & ChrW(8203) & vVALs(r, 17)) = vTMP
Next r
'step 3 - put the de-duplicated values back into the array
r = 0
ReDim vVALs(1 To dDQs.Count, LBound(vVALs, 2) To UBound(vVALs, 2))
For Each ky In dDQs
r = r + 1
vTMP = dDQs.Item(ky)
For c = LBound(vTMP) To UBound(vTMP)
vVALs(r, c + 1) = vTMP(c)
Next c
Next ky
'step 4 - clear the destination; put the de-duplicated values back into the worksheet and reset .UsedRange
With Worksheets("Sheet1") 'you should know what worksheet you are on
With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
.ClearContents 'retain formatting if it is there
.Cells(1, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
End With
End With
.UsedRange 'assert the UsedRange property (refreshes it)
End With
dDQs.RemoveAll: Set dDQs = Nothing
appTGGL
End Sub
Public Sub appTGGL(Optional bTGGL As Boolean = True)
With Application
.ScreenUpdating = bTGGL
.EnableEvents = bTGGL
.DisplayAlerts = bTGGL
.AutoRecover.Enabled = bTGGL 'no interruptions with an auto-save
.Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
.CutCopyMode = False
.StatusBar = vbNullString
End With
Debug.Print Timer
End Sub
My sample workbook took 100K rows × 87 column with ~24% duplicates and processed all duplicates (keeping the last entries) in ~40 seconds. The above writes back to Sheet1; my tests were run writing back to Sheet2 in order to retain the original data. If you choose to write back to a different worksheet, make sure that there are some values starting at A1 in order that the Range.CurrentRegion property can be properly identified. The test machine was an older laptop running 32-bit Excel 2010; your own results will likely vary.
¹ See Highlight Duplicates and Filter by color alternative for tip[s on dealing with large data sets in Excel.