Improve INSERT-per-second performance of SQLite

后端 未结 10 2261
忘掉有多难
忘掉有多难 2020-11-21 04:41

Optimizing SQLite is tricky. Bulk-insert performance of a C application can vary from 85 inserts per second to over 96,000 inserts per second!

Background:

相关标签:
10条回答
  • 2020-11-21 05:41

    On bulk inserts

    Inspired by this post and by the Stack Overflow question that led me here -- Is it possible to insert multiple rows at a time in an SQLite database? -- I've posted my first Git repository:

    https://github.com/rdpoor/CreateOrUpdate

    which bulk loads an array of ActiveRecords into MySQL, SQLite or PostgreSQL databases. It includes an option to ignore existing records, overwrite them or raise an error. My rudimentary benchmarks show a 10x speed improvement compared to sequential writes -- YMMV.

    I'm using it in production code where I frequently need to import large datasets, and I'm pretty happy with it.

    0 讨论(0)
  • 2020-11-21 05:44

    After reading this tutorial, I tried to implement it to my program.

    I have 4-5 files that contain addresses. Each file has approx 30 million records. I am using the same configuration that you are suggesting but my number of INSERTs per second is way low (~10.000 records per sec).

    Here is where your suggestion fails. You use a single transaction for all the records and a single insert with no errors/fails. Let's say that you are splitting each record into multiple inserts on different tables. What happens if the record is broken?

    The ON CONFLICT command does not apply, cause if you have 10 elements in a record and you need each element inserted to a different table, if element 5 gets a CONSTRAINT error, then all previous 4 inserts need to go too.

    So here is where the rollback comes. The only issue with the rollback is that you lose all your inserts and start from the top. How can you solve this?

    My solution was to use multiple transactions. I begin and end a transaction every 10.000 records (Don't ask why that number, it was the fastest one I tested). I created an array sized 10.000 and insert the successful records there. When the error occurs, I do a rollback, begin a transaction, insert the records from my array, commit and then begin a new transaction after the broken record.

    This solution helped me bypass the issues I have when dealing with files containing bad/duplicate records (I had almost 4% bad records).

    The algorithm I created helped me reduce my process by 2 hours. Final loading process of file 1hr 30m which is still slow but not compared to the 4hrs that it initially took. I managed to speed the inserts from 10.000/s to ~14.000/s

    If anyone has any other ideas on how to speed it up, I am open to suggestions.

    UPDATE:

    In Addition to my answer above, you should keep in mind that inserts per second depending on the hard drive you are using too. I tested it on 3 different PCs with different hard drives and got massive differences in times. PC1 (1hr 30m), PC2 (6hrs) PC3 (14hrs), so I started wondering why would that be.

    After two weeks of research and checking multiple resources: Hard Drive, Ram, Cache, I found out that some settings on your hard drive can affect the I/O rate. By clicking properties on your desired output drive you can see two options in the general tab. Opt1: Compress this drive, Opt2: Allow files of this drive to have contents indexed.

    By disabling these two options all 3 PCs now take approximately the same time to finish (1hr and 20 to 40min). If you encounter slow inserts check whether your hard drive is configured with these options. It will save you lots of time and headaches trying to find the solution

    0 讨论(0)
  • 2020-11-21 05:46

    If you care only about reading, somewhat faster (but might read stale data) version is to read from multiple connections from multiple threads (connection per-thread).

    First find the items, in the table:

    SELECT COUNT(*) FROM table
    

    then read in pages (LIMIT/OFFSET):

    SELECT * FROM table ORDER BY _ROWID_ LIMIT <limit> OFFSET <offset>
    

    where and are calculated per-thread, like this:

    int limit = (count + n_threads - 1)/n_threads;
    

    for each thread:

    int offset = thread_index * limit
    

    For our small (200mb) db this made 50-75% speed-up (3.8.0.2 64-bit on Windows 7). Our tables are heavily non-normalized (1000-1500 columns, roughly 100,000 or more rows).

    Too many or too little threads won't do it, you need to benchmark and profile yourself.

    Also for us, SHAREDCACHE made the performance slower, so I manually put PRIVATECACHE (cause it was enabled globally for us)

    0 讨论(0)
  • 2020-11-21 05:47

    Try using SQLITE_STATIC instead of SQLITE_TRANSIENT for those inserts.

    SQLITE_TRANSIENT will cause SQLite to copy the string data before returning.

    SQLITE_STATIC tells it that the memory address you gave it will be valid until the query has been performed (which in this loop is always the case). This will save you several allocate, copy and deallocate operations per loop. Possibly a large improvement.

    0 讨论(0)
提交回复
热议问题