I have a Table with 10 columns and in that table I have thousands/millions of rows. In some scenario, I want to update more than 10K records at a time. currently my scenario cod
In general, a "bulk" update will be faster, regardless of database. Of course, you can test the performance of the two, and report back.
Each call to update
requires a bunch of overhead, in terms of processing the query, setting up locks on tables/pages/rows. Doing a single update consolidates this overhead.
The downside to a single update
is that it might be faster overall, but it might lock underlying resources for longer periods of time. For instance, the single updates might take 10 milliseconds each, for an elapsed time of 10 seconds for 1,000 of them. However, no resource is locked for more than 10 milliseconds. The bulk update might take 5 seconds, but the resources would be locked for more of this period.
To speed these updates, be sure that id
is indexed.
I should note. This is a general principle. I have not specifically tested single versus multiple update performance on DB2.
If you are using .NET (and there's probably a similar option in other languages like Java), there is a option you can use on your DB2Connection class called BeginChain, which will greatly improve performance.
Basically, when you have the chain option activated, your DB2 client will keep all of the commands in a queue. When you call EndChain, the queue will be sent to the server at once, and processed at one time.
The documentation says that this should perform much better than non-chained UPDATE
/INSERT
/DELETE
s (and this is what we've seen in my shop), but there are some differences you might need to be aware of:
ExecuteNonQuery
will return -1
when chaining is active.Additionally, performance can be improved further by using a query with Parameter Markers instead of separate individual queries (assuming status can change as well, otherwise, you might just use a literal):
UPDATE tab1
SET status = @status
WHERE id = @id
Edit for comment: I'm not sure if the confusion is in using Parameter Markers (which are just placeholders for values in a query, see the link for more details), or in the actual usage of chaining. If it is the second, then here is some example code (I didn't verify that it works, so use at your own risk :)
):
//Below is a function that returns an open DB2Connection
//object. It can vary by shop, so put it whatever you do.
using (var conn = (DB2Connection) GetConnection())
{
using (var trans = conn.BeginTransaction())
{
var sb = new StringBuilder();
sb.AppendLine("UPDATE tab1 ");
sb.AppendLine(" SET status = 'HISTORY' ");
sb.AppendLine(" WHERE id = @id");
trans.Connection.BeginChain();
using (var cmd = trans.Connection.CreateCommand())
{
cmd.CommandText = sb.ToString();
cmd.Transaction = trans;
foreach (var id in ids)
{
cmd.Parameters.Clear();
cmd.Parameters.Add("@id", id);
cmd.ExecuteNonQuery();
}
}
trans.Connection.EndChain();
trans.Commit();
}
}
One other aspect I would like to point out is the commit interval. If the single update statement updates few 100 K rows, the transaction log also grows acordingly, it might become slower. I have seen reduction in total time while using ETL tools like informatica which fired sets of update statements per record followed by a commit- compared to a single update statement based on conditions to do it in a single go. This was counter-intuitive for me.
I came here with same question a week back. Then I faced a situation where I had to update a table with around 3500 rows in mySQL database through JDBC. I updated same table twice: once through a For loop, by iterating through a collection of objects, and once using a bulk update query. Here are my findings:
clearly, bulk update wins by a huge margin.
To answer this, let's see how a query actually gets executed in DBMS.
Unlike procedural languages, you instruct the DBMS what to do, but not how to do. The DBMS then does the followings.
Now, when you update a table in database row by row, each of the queries you execute goes through parsing, optimization and execution. In stead if you write a loop to create a rather long query, and then execute the same, it is parsed only once. And the amount of time you save by using batch update in place of iterative approach increases almost linearly with number of rows you update.
You will definitely see a performance improvement, because you will reduce the number of roundtrips.
However, this approach does not scale very well; thousands of ID's in one statement could get a bit tricky. Also, there is a limit on the size of your query (could be 64k). You could consider to 'page' through your table and update - say - 100 records per update statement.