How to efficiently use Batch writes to cassandra using datastax java driver?

亡梦爱人 提交于 2019-12-05 10:46:10

First a bit of a rant:

The batch keyword in Cassandra is not a performance optimization for batching together large buckets of data for bulk loads.

Batches are used to group together atomic operations, actions that you expect to occur together. Batches guarantee that if a single part of your batch is successful, the entire batch is successful.

Using batches will probably not make your mass ingestion run faster

Now for your questions:

what is the purpose of unloggedBatch here?

Cassandra uses a mechanism called batch logging in order to ensure a batch's atomicity. By specifying unlogged batch, you are turning off this functionality so the batch is no longer atomic and may fail with partial completion. Naturally, there is a performance penalty for logging your batches and ensuring their atomicity, using unlogged batches will removes this penalty.

There are some cases in which you may want to use unlogged batches to ensure that requests (inserts) that belong to the same partition, are sent together. If you batch operations together and they need to be performed in different partitions / nodes, you are essentially creating more work for your coordinator. See specific examples of this in Ryan's blog:

Read this post

Now my question is - Does the way I am using Batch to insert into cassandra with Datastax Java Driver is correct?

I don't see anything wrong with your code here, just depends on what you're trying to achieve. Dig into that blog post I shared for more insight.

And what about retry policies, meaning if batch statement execution failed, then what will happen, will it retry again?

A batch on it's own will not retry on it's own if it fails. The driver does have retry policies but you have to apply those separately.

The default policy in the java driver only retries in these scenarios:

  • On a read timeout, if enough replica replied but data was not retrieved.
  • On a write timeout, if we timeout while writing the distributed log used by batch statements.

Read more about the default policy and consider less conservative policies based on your use case.

We debated for a while between using async and batches. We tried out both to compare. We got better throughput using "unlogged batches" compared to individual "async" requests. We dont know why, but based on Ryan's blog, I am guessing it has got to do with the write size. We probably are doing too many smaller writes and so batching them probably gave us better performance as it does reduce network traffic.

I have to mention that we are not even doing "unlogged batches" in the recommended way. The recommended way is to do a batch with a single-partition key. Basically, batch all the records which belong to the same partition key. But, we were just batching some records which probably belong to different partitions.

Someone did some benchmarking to compare async and "unlogged batches" and we found that quite useful. Here is the link.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!