Best practices for importing large CSV files

前端 未结 10 822
攒了一身酷
攒了一身酷 2020-12-14 16:39

My company gets a set of CSV files full of bank account info each month that I need to import into a database. Some of these files can be pretty big. For example, one is abo

相关标签:
10条回答
  • 2020-12-14 17:12

    I had this exact same problem about 2 weeks ago. I wrote some .NET to do ROW BY ROW inserts and by my calculations with the amount of data I had, it would take around a week to this it this way.

    So instead I used a string builder to create one HUGE query and sent it to my relational system all at once. It went from taking a week to taking 5 minutes. Now I don't know what relational system you are using, but with enormous queries you'll probably have to tweak your max_allowed_packet param or similar.

    0 讨论(0)
  • 2020-12-14 17:16

    I need to do this too from time to time (import large non-standardized CSVs where each row creates a dozen or so related DB objects) so I wrote a python script where I can specify what goes where and how it's all related. The script then simply generates INSERT statements.

    Here it is: csv2db

    Disclaimer: I'm basically a noob when it comes to databases, so there might be better ways to accomplish this.

    0 讨论(0)
  • 2020-12-14 17:18

    Forgive me if I'm not exactly understanding your issue correctly, but it seems like you're just trying to get a large amount of CSV data into a SQL database. Is there any reason why you want to use a web app or other code to process the CSV data into INSERT statements? I've had success importing large amounts of CSV data into SQL Server Express (free version) using SQL Server Management Studio and using BULK INSERT statements. A simple bulk insert would look like this:

    BULK INSERT [Company].[Transactions]
        FROM "C:\Bank Files\TransactionLog.csv"
        WITH
        (
            FIELDTERMINATOR = '|',
            ROWTERMINATOR = '\n',
            MAXERRORS = 0,
            DATAFILETYPE = 'widechar',
            KEEPIDENTITY
        )
    GO
    
    0 讨论(0)
  • 2020-12-14 17:18

    First: 33MB is not big. MySQL can easily handle data of this size.

    As you noticed, row-by-row insertion is slow. Using an ORM on top of that is even slower: there's overhead for building objects, serialization, and so on. Using an ORM to do this across 35 tables is even slower. Don't do this.

    You can indeed use LOAD DATA INFILE; just write a script that transforms your data into the desired format, separating it into per-table files in the process. You can then LOAD each file into the proper table. This script can be written in any language.

    Aside from that, bulk INSERT (column, ...) VALUES ... also works. Don't guess what your row batch size should be; time it empirically, as the optimal batch size will depend on your particular database setup (server configuration, column types, indices, etc.)

    Bulk INSERT is not going to be as fast as LOAD DATA INFILE, and you'll still have to write a script to transform raw data into usable INSERT queries. For this reason, I'd probably do LOAD DATA INFILE if at all possible.

    0 讨论(0)
提交回复
热议问题