Cassandra Bulk-Write performance with Java Driver is atrocious compared to MongoDB

后端 未结 2 1625
灰色年华
灰色年华 2021-01-07 02:50

I have built an importer for MongoDB and Cassandra. Basically all operations of the importer are the same, except for the last part where data gets formed to match the neede

相关标签:
2条回答
  • 2021-01-07 03:17

    When you run a batch in Cassandra, it chooses a single node to act as the coordinator. This node then becomes responsible for seeing to it that the batched writes find their appropriate nodes. So (for example) by batching 10000 writes together, you have now tasked one node with the job of coordinating 10000 writes, most of which will be for different nodes. It's very easy to tip over a node, or kill latency for an entire cluster by doing this. Hence, the reason for the limit on batch sizes.

    The problem is that Cassandra CQL BATCH is a misnomer, and it doesn't do what you or anyone else thinks that it does. It is not to be used for performance gains. Parallel, asynchronous writes will always be faster than running the same number of statements BATCHed together.

    I know that I could easily batch 10.000 rows together because they will go to the same partition. ... Would you still use single row inserts (async) rather than batches?

    That depends on whether or not write performance is your true goal. If so, then I'd still stick with parallel, async writes.

    For some more good info on this, check out these two blog posts by DataStax's Ryan Svihla:

    Cassandra: Batch loading without the Batch keyword

    Cassandra: Batch Loading Without the Batch — The Nuanced Edition

    0 讨论(0)
  • 2021-01-07 03:33

    After using C* for a bit, I'm convinced you should really use batches only for keeping multiple tables in sync. If you don't need that feature, then don't use batches at all because you will incur in performance penalties.

    The correct way to load data into C* is with async writes, with optional backpressure if your cluster can't keep up with the ingestion rate. You should replace your "custom" batching method with something that:

    • performs async writes
    • keep under control how many inflight writes you have
    • perform some retry when a write timeouts.

    To perform async writes, use the .executeAsync method, that will return you a ResultSetFuture object.

    To keep under control how many inflight queries just collect the ResultSetFuture object retrieved from the .executeAsync method in a list, and if the list gets (ballpark values here) say 1k elements then wait for all of them to finish before issuing more writes. Or you can wait for the first to finish before issuing one more write, just to keep the list full.

    And finally, you can check for write failures when you're waiting on an operation to complete. In that case, you could:

    1. write again with the same timeout value
    2. write again with an increased timeout value
    3. wait some amount of time, and then write again with the same timeout value
    4. wait some amount of time, and then write again with an increased timeout value

    From 1 to 4 you have an increased backpressure strength. Pick the one that best fit your case.


    EDIT after question update

    Your insert logic seems a bit broken to me:

    1. I don't see any retry logic
    2. You don't remove the item in the list if it fails
    3. Your while (concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) is wrong, because you will sleep only when the number of issued queries is > concurrentInsertLimit, and because of 2. your thread will just park there.
    4. You never set to false concurrentInsertErrorOccured

    I usually keep a list of (failed) queries for the purpose of retrying them at later time. That gives me powerful control on the queries, and when the failed queries starts to accumulate I sleep for a few moments, and then keep on retrying them (up to X times, then hard fail...).

    This list should be very dynamic, eg you add items there when queries fail, and remove items when you perform a retry. Now you can understand the limits of your cluster, and tune your concurrentInsertLimit based on eg the avg number of failed queries in the last second, or stick with the simpler approach "pause if we have an item in the retry list" etc...


    EDIT 2 after comments

    Since you don't want any retry logic, I would change your code this way:

    private List<ResultSetFuture> runningInsertList;
    private static int concurrentInsertLimit = 1000;
    private static int concurrentInsertSleepTime = 500;
    ...
    
    @Override
    public void executeBatch(Statement statement) throws InterruptedException {
        if (this.runningInsertList == null) {
            this.runningInsertList = new ArrayList<>();
        }
    
        ResultSetFuture future = this.executeAsync(statement);
        this.runningInsertList.add(future);
    
        Futures.addCallback(future, new FutureCallback<ResultSet>() {
            @Override
            public void onSuccess(ResultSet result) {
                runningInsertList.remove(future);
            }
    
            @Override
            public void onFailure(Throwable t) {
                runningInsertList.remove(future);
                concurrentInsertErrorOccured = true;
            }
        }, MoreExecutors.sameThreadExecutor());
    
        //Sleep while the currently processing number of inserts is too high
        while (runningInsertList.size() >= concurrentInsertLimit) {
            Thread.sleep(concurrentInsertSleepTime);
        }
    
        if (!concurrentInsertErrorOccured) {
            // Increase your ingestion rate if no query failed so far
            concurrentInsertLimit += 10;
        } else {
            // Decrease your ingestion rate because at least one query failed
            concurrentInsertErrorOccured = false;
            concurrentInsertLimit = Max(1, concurrentInsertLimit - 50);
            while (runningInsertList.size() >= concurrentInsertLimit) {
                Thread.sleep(concurrentInsertSleepTime);
            }
        }
    
        return;
    }
    

    You could also optimize a bit the procedure by replacing your List<ResultSetFuture> with a counter.

    Hope that helps.

    0 讨论(0)
提交回复
热议问题