SQLite insert speed slows as number of records increases due to an index

前端 未结 5 1426
無奈伤痛
無奈伤痛 2020-11-29 19:59

Original question

Background

It is well-known that SQLite needs to be fine tuned to achieve insert speeds on the order of 50k inserts/s. There are many q

相关标签:
5条回答
  • 2020-11-29 20:16

    If your requirement is to find a particular z_id and the x_ids and y_ids linked to it (as distinct from quickly selecting a range of z_ids) you could look into a non-indexed hash-table nested-relational db that would allow you to instantly find your way to a particular z_id in order to get its y_ids and x_ids -- without the indexing overhead and the concomitant degraded performance during inserts as the index grows. In order to avoid clumping (aka bucket collisions), choose a key hashing algorithm that puts greatest weight on the digits of z_id with greatest variation (right-weighted).

    P.S. A database that uses a b-tree may at first appear faster than a db that uses linear hashing, say, but the insert performance will remain level with the linear hash as performance on the b-tree begins to degrade.

    P.P.S. To answer @kawing-chiu's question: the core feature relevant here is that such a database relies on so-called "sparse" tables in which the physical location of a record is determined by a hashing algorithm which takes the record key as input. This approach permits a seek directly to the record's location in the table without the intermediary of an index. As there is no need to traverse indexes or to re-balance indexes, insert-times remain constant as the table becomes more densely populated. With a b-tree, by contrast, insert times degrade as the index tree grows. OLTP applications with large numbers of concurrent inserts can benefit from such a sparse-table approach. The records are scattered throughout the table. The downside of records being scattered across the "tundra" of the sparse table is that gathering large sets of records which have a value in common, such as a postal code, can be slower. The hashed sparse-table approach is optimized to insert and retrieve individual records, and to retrieve networks of related records, not large sets of records that have some field value in common.

    A nested relational database is one that permits tuples within a column of a row.

    0 讨论(0)
  • 2020-11-29 20:20

    Great question and very interesting follow-up!

    I would just like to make a quick remark: You mentioned that breaking the table into smaller subtables / files and attaching them later is not an option because you'll quickly reach the hard limit of 62 attached databases. While this is completely true, I don't think you have considered a mid-way option: sharding the data into several tables but keep using the same, single database (file).


    I did a very crude benchmark just to make sure my suggestion really has an impact on performance.

    Schema:

    CREATE TABLE IF NOT EXISTS "test_$i"
    (
        "i" integer NOT NULL,
        "md5" text(32) NOT NULL
    );
    

    Data - 2 Million Rows:

    • i = 1..2,000,000
    • md5 = md5 hex digest of i

    Each transaction = 50,000 INSERTs.


    Databases: 1; Tables: 1; Indexes: 0

    0..50000 records inserted in 1.87 seconds
    50000..100000 records inserted in 1.92 seconds
    100000..150000 records inserted in 1.97 seconds
    150000..200000 records inserted in 1.99 seconds
    200000..250000 records inserted in 2.19 seconds
    250000..300000 records inserted in 1.94 seconds
    300000..350000 records inserted in 1.94 seconds
    350000..400000 records inserted in 1.94 seconds
    400000..450000 records inserted in 1.94 seconds
    450000..500000 records inserted in 2.50 seconds
    500000..550000 records inserted in 1.94 seconds
    550000..600000 records inserted in 1.94 seconds
    600000..650000 records inserted in 1.93 seconds
    650000..700000 records inserted in 1.94 seconds
    700000..750000 records inserted in 1.94 seconds
    750000..800000 records inserted in 1.94 seconds
    800000..850000 records inserted in 1.93 seconds
    850000..900000 records inserted in 1.95 seconds
    900000..950000 records inserted in 1.94 seconds
    950000..1000000 records inserted in 1.94 seconds
    1000000..1050000 records inserted in 1.95 seconds
    1050000..1100000 records inserted in 1.95 seconds
    1100000..1150000 records inserted in 1.95 seconds
    1150000..1200000 records inserted in 1.95 seconds
    1200000..1250000 records inserted in 1.96 seconds
    1250000..1300000 records inserted in 1.98 seconds
    1300000..1350000 records inserted in 1.95 seconds
    1350000..1400000 records inserted in 1.95 seconds
    1400000..1450000 records inserted in 1.95 seconds
    1450000..1500000 records inserted in 1.95 seconds
    1500000..1550000 records inserted in 1.95 seconds
    1550000..1600000 records inserted in 1.95 seconds
    1600000..1650000 records inserted in 1.95 seconds
    1650000..1700000 records inserted in 1.96 seconds
    1700000..1750000 records inserted in 1.95 seconds
    1750000..1800000 records inserted in 1.95 seconds
    1800000..1850000 records inserted in 1.94 seconds
    1850000..1900000 records inserted in 1.95 seconds
    1900000..1950000 records inserted in 1.95 seconds
    1950000..2000000 records inserted in 1.95 seconds
    

    Database file size: 89.2 MiB.


    Databases: 1; Tables: 1; Indexes: 1 (md5)

    0..50000 records inserted in 2.90 seconds
    50000..100000 records inserted in 11.64 seconds
    100000..150000 records inserted in 10.85 seconds
    150000..200000 records inserted in 10.62 seconds
    200000..250000 records inserted in 11.28 seconds
    250000..300000 records inserted in 12.09 seconds
    300000..350000 records inserted in 10.60 seconds
    350000..400000 records inserted in 12.25 seconds
    400000..450000 records inserted in 13.83 seconds
    450000..500000 records inserted in 14.48 seconds
    500000..550000 records inserted in 11.08 seconds
    550000..600000 records inserted in 10.72 seconds
    600000..650000 records inserted in 14.99 seconds
    650000..700000 records inserted in 10.85 seconds
    700000..750000 records inserted in 11.25 seconds
    750000..800000 records inserted in 17.68 seconds
    800000..850000 records inserted in 14.44 seconds
    850000..900000 records inserted in 19.46 seconds
    900000..950000 records inserted in 16.41 seconds
    950000..1000000 records inserted in 22.41 seconds
    1000000..1050000 records inserted in 24.68 seconds
    1050000..1100000 records inserted in 28.12 seconds
    1100000..1150000 records inserted in 26.85 seconds
    1150000..1200000 records inserted in 28.57 seconds
    1200000..1250000 records inserted in 29.17 seconds
    1250000..1300000 records inserted in 36.99 seconds
    1300000..1350000 records inserted in 30.66 seconds
    1350000..1400000 records inserted in 32.06 seconds
    1400000..1450000 records inserted in 33.14 seconds
    1450000..1500000 records inserted in 47.74 seconds
    1500000..1550000 records inserted in 34.51 seconds
    1550000..1600000 records inserted in 39.16 seconds
    1600000..1650000 records inserted in 37.69 seconds
    1650000..1700000 records inserted in 37.82 seconds
    1700000..1750000 records inserted in 41.43 seconds
    1750000..1800000 records inserted in 49.58 seconds
    1800000..1850000 records inserted in 44.08 seconds
    1850000..1900000 records inserted in 57.17 seconds
    1900000..1950000 records inserted in 50.04 seconds
    1950000..2000000 records inserted in 42.15 seconds
    

    Database file size: 181.1 MiB.


    Databases: 1; Tables: 20 (one per 100,000 records); Indexes: 1 (md5)

    0..50000 records inserted in 2.91 seconds
    50000..100000 records inserted in 10.30 seconds
    100000..150000 records inserted in 10.85 seconds
    150000..200000 records inserted in 10.45 seconds
    200000..250000 records inserted in 10.11 seconds
    250000..300000 records inserted in 11.04 seconds
    300000..350000 records inserted in 10.25 seconds
    350000..400000 records inserted in 10.36 seconds
    400000..450000 records inserted in 11.48 seconds
    450000..500000 records inserted in 10.97 seconds
    500000..550000 records inserted in 10.86 seconds
    550000..600000 records inserted in 10.35 seconds
    600000..650000 records inserted in 10.77 seconds
    650000..700000 records inserted in 10.62 seconds
    700000..750000 records inserted in 10.57 seconds
    750000..800000 records inserted in 11.13 seconds
    800000..850000 records inserted in 10.44 seconds
    850000..900000 records inserted in 10.40 seconds
    900000..950000 records inserted in 10.70 seconds
    950000..1000000 records inserted in 10.53 seconds
    1000000..1050000 records inserted in 10.98 seconds
    1050000..1100000 records inserted in 11.56 seconds
    1100000..1150000 records inserted in 10.66 seconds
    1150000..1200000 records inserted in 10.38 seconds
    1200000..1250000 records inserted in 10.24 seconds
    1250000..1300000 records inserted in 10.80 seconds
    1300000..1350000 records inserted in 10.85 seconds
    1350000..1400000 records inserted in 10.46 seconds
    1400000..1450000 records inserted in 10.25 seconds
    1450000..1500000 records inserted in 10.98 seconds
    1500000..1550000 records inserted in 10.15 seconds
    1550000..1600000 records inserted in 11.81 seconds
    1600000..1650000 records inserted in 10.80 seconds
    1650000..1700000 records inserted in 11.06 seconds
    1700000..1750000 records inserted in 10.24 seconds
    1750000..1800000 records inserted in 10.57 seconds
    1800000..1850000 records inserted in 11.54 seconds
    1850000..1900000 records inserted in 10.80 seconds
    1900000..1950000 records inserted in 11.07 seconds
    1950000..2000000 records inserted in 13.27 seconds
    

    Database file size: 180.1 MiB.


    As you can see, the insert speed remains pretty much constant if you shard the data into several tables.

    0 讨论(0)
  • 2020-11-29 20:24

    I suspect the Index's hash value collision causes the insert speed slow.

    When we have many many rows in one table, and then the indexed column hash value collision will happen more frequently. It means Sqlite engine needs to calculate the hash value two times or three times, or maybe even four times, in order to get a different hash value.

    So I guess this is the root cause of the SQLite insert slowness when the table has many rows.

    This point could explain why using shards could avoid this problem. Who's a real expert in SQLite domain to confirm or deny my point here?

    0 讨论(0)
  • 2020-11-29 20:29

    In my project, I couldn't shard the database, as it's indexed on different columns. To speed up the inserts, I've put the database during creation on /dev/shm (=linux ramdisk) and then copy it over to local disk. That obviously only works well for a write-once, read-many database.

    0 讨论(0)
  • 2020-11-29 20:30

    Unfortunately I'd say this is a limitation of large tables in SQLite. It's not designed to operate on large-scale or large-volume datasets. While I understand it may drastically increase project complexity, you're probably better off researching more sophisticated database solutions appropriate to your needs.

    From everything you linked, it looks like table size to access speed is a direct tradeoff. Can't have both.

    0 讨论(0)
提交回复
热议问题