Best practices for inserting/updating large amount of data in SQL Server 2008

江枫思渺然 提交于 2019-11-29 01:37:12

Seeing that you're using SQL Server 2008, I would recommend this approach:

  • first bulkcopy your CSV files into a staging table
  • update your target table from that staging table using the MERGE command

Check out the MSDN docs and a great blog post on how to use the MERGE command.

Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when

  • the rows match, e.g. the row exists in both the source and the target table --> typically you'd either update some fields, or just ignore it all together
  • the row from the source doesn't exist in the target --> typically a case for an INSERT

You would have a MERGE statement something like this:

MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey

WHEN NOT MATCHED THEN
  INSERT (list OF fields)
  VALUES (list OF values)

WHEN MATCHED THEN
  UPDATE
    SET (list OF SET statements)
;

Of course, the ON clause can be much more involved if needed. And of course, your WHEN statements can also be more complex, e.g.

WHEN MATCHED AND (some other condition) THEN ......

and so forth.

MERGE is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!

HLGEM

Your way is the worst possible solution. In general, you should not think in terms of looping through records individually. We used to have a company built import tool that loops through records, it would take 18-20 hours to load a file with over a million records (something that wasn't a frequent occurrence when it was built but which is a many times a day occurrence now).

I see two options: First use bulk insert to load to a staging table and do whatever clean up you need to do on that table. How are you determining if the record already exists? You should be able to build a set-based update by joining to the staging table on those fields which determine the update. Often I have a added a column to my staging table for the id of the record it matches to and populated that through a query then done the update. Then you do an insert of the records which don't have a corresponding id. If you have too many records to do all at once, you may want to run in batches (which yes is a loop), but make the batches considerably larger than 1 record at a time (I usually start with 2000 and then based on the time it takes for that determine if I can do more or less in the batch).

I think 2008 also has a merge statement but I have not yet had a chance to use it. Look it up in books online.

The alternative is to use SSIS which is optimized for speed. SSIS is a complex thing though and the learning curve is steep.

One way is load your CSV into a DataTable (or more likely a DataReader) and then batch slam in the results using SqlBulkCopy -

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

Its pretty efficient and you can do some column mapping. Tip - when you map columns using SqlBulkCopy they are case sensitive.

Another approach would be to write a .Net stored procedure on server on the server to operate on the entire file...

Only if you need more control than Kris Krause's solution though - I'm a big fan of keeping it simple (and reusable) where we can...

Do you need to be rolling your own here at all? Would it be possible to provide the data in such a way that the SQL Server can use Bulk Import to load it in and then deal with duplicates in the database once the import is complete?

When it comes to heavy lifting with a lot of data my experience tends to be that working in the database as much as possible is much quicker and less resource intensive.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!