Cassandra Bulk-Write performance with Java Driver is atrocious compared to MongoDB

后端 未结 2 1624
灰色年华
灰色年华 2021-01-07 02:50

I have built an importer for MongoDB and Cassandra. Basically all operations of the importer are the same, except for the last part where data gets formed to match the neede

2条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-07 03:17

    When you run a batch in Cassandra, it chooses a single node to act as the coordinator. This node then becomes responsible for seeing to it that the batched writes find their appropriate nodes. So (for example) by batching 10000 writes together, you have now tasked one node with the job of coordinating 10000 writes, most of which will be for different nodes. It's very easy to tip over a node, or kill latency for an entire cluster by doing this. Hence, the reason for the limit on batch sizes.

    The problem is that Cassandra CQL BATCH is a misnomer, and it doesn't do what you or anyone else thinks that it does. It is not to be used for performance gains. Parallel, asynchronous writes will always be faster than running the same number of statements BATCHed together.

    I know that I could easily batch 10.000 rows together because they will go to the same partition. ... Would you still use single row inserts (async) rather than batches?

    That depends on whether or not write performance is your true goal. If so, then I'd still stick with parallel, async writes.

    For some more good info on this, check out these two blog posts by DataStax's Ryan Svihla:

    Cassandra: Batch loading without the Batch keyword

    Cassandra: Batch Loading Without the Batch — The Nuanced Edition

提交回复
热议问题