A query that is used to loop through 17 millions records to remove duplicates has been running now for about 16 hours and I wanted to know if the query is sto
DELETES that have been performed up to this point will not be rolled back.
As the original author of the code in question, and having issued the caveat that performance will be dependant on indexes, I would propose the following items to speed this up.
RecordId better be PRIMARY KEY. I don't mean IDENTITY, I mean PRIMARY KEY. Confirm this using sp_help
Some index should be used in evaluating this query. Figure out which of these four columns has the least repeats and index that...
SELECT *
FROM MyTable
WHERE @long = longitude
AND @lat = latitude
AND @businessname = BusinessName
AND @phoneNumber = Phone
Before and After adding this index, check the query plan to see if index scanning has been added.
If no 'Implicit transactions' has been set, then each iteration in your loop committed the changes.
It is possible for any SQL Server to be set with 'Implicit transactions'. This is a database setting (by default is OFF). You can also have implicit transactions in the properties of a particular query inside of Management Studio (right click in query pane>options), by default settings in the client, or a SET statement.
SET IMPLICIT_TRANSACTIONS ON;
Either way, if this was the case, you would still need to execute an explicit COMMIT/ROLLBACK regardless of interruption of the query execution.
Implicit transactions reference:
http://msdn.microsoft.com/en-us/library/ms188317.aspx
http://msdn.microsoft.com/en-us/library/ms190230.aspx
Also try thinking another method to remove duplicate rows:
delete t1 from table1 as t1 where exists (
select * from table1 as t2 where
t1.column1=t2.column1 and
t1.column2=t2.column2 and
t1.column3=t2.column3 and
--add other colums if any
t1.id>t2.id
)
I suppose that you have an integer id column in your table.
I'm pretty sure that is a negatory. Otherwise what would the point of transactions be?
I think you need to seriously consider your methodolology. You need to start thinking in sets (although for performance you may need batch processing, but not row by row against a 17 million record table.)
First do all of your records have duplicates? I suspect not, so the first thing you wan to do is limit your processing to only those records which have duplicates. Since this is a large table and you may need to do the deletes in batches over time depending on what other processing is going on, you first pull the records you want to deal with into a table of their own that you then index. You can also use a temp table if you are going to be able to do this all at the same time without ever stopping it other wise create a table in your database and drop at the end.
Something like (Note I didn't write the create index statments, I figure you can look that up yourself):
SELECT min(m.RecordID), m.longitude, m.latitude, m.businessname, m.phone
into #RecordsToKeep
FROM MyTable m
join
(select longitude, latitude, businessname, phone
from MyTable
group by longitude, latitude, businessname, phone
having count(*) >1) a
on a.longitude = m.longitude and a.latitude = m.latitude and
a.businessname = b.businessname and a.phone = b.phone
group by m.longitude, m.latitude, m.businessname, m.phone
ORDER BY CASE WHEN m.webAddress is not null THEN 1 ELSE 2 END,
CASE WHEN m.caption1 is not null THEN 1 ELSE 2 END,
CASE WHEN m.caption2 is not null THEN 1 ELSE 2 END
while (select count(*) from #RecordsToKeep) > 0
begin
select top 1000 *
into #Batch
from #RecordsToKeep
Delete m
from mytable m
join #Batch b
on b.longitude = m.longitude and b.latitude = m.latitude and
b.businessname = b.businessname and b.phone = b.phone
where r.recordid <> b.recordID
Delete r
from #RecordsToKeep r
join #Batch b on r.recordid = b.recordid
end
Delete m
from mytable m
join #RecordsToKeep r
on r.longitude = m.longitude and r.latitude = m.latitude and
r.businessname = b.businessname and r.phone = b.phone
where r.recordid <> m.recordID
If your machine doesn't have very advanced hardware then it may take sql server a very long time to complete that command. I don't know for sure how this operation is performed under the hood but based on my experience this could be done more efficiently by bringing the records out of the database and into memory for a program that uses a tree structure with a remove duplicate rule for insertion. Try reading the entirety of the table in chuncks (say 10000 rows at a time) into a C++ program using ODBC. Once in the C++ program use and std::map where key is the unique key and struct is a struct that holds the rest of the data in variables. Loop over all the records and perform insertion into the map. The map insert function will handle removing the duplicates. Since search inside a map is lg(n) time far less time to find duplicates than using your while loop. You can then delete the entire table and add the tuples back into the database from the map by forming insert queries and executing them via odbc or building a text file script and running it in management studio.