I have had issues with spark-cassandra-connector (1.0.4, 1.1.0) when writing batches of 9 millions rows to a 12 nodes cassandra (2.1.2) cluster. I was writing with consisten
A few possibilities:
Your async example is issuing 10 writes at time with 9 threads, so 90 at a time while your sync example is only doing 45 writes at a time, so I would try cutting the async down to the same rate so it's an apples to apples comparison.
You don't say how you're checking for exceptions with the async approach. I see you are using future.get()
, but it is recommended to use getUninterruptibly()
as noted in the documentation:
Waits for the query to return and return its result. This method is usually more convenient than Future.get() because it: Waits for the result uninterruptibly, and so doesn't throw InterruptedException. Returns meaningful exceptions, instead of having to deal with ExecutionException. As such, it is the preferred way to get the future result.
So perhaps you're not seeing write exceptions that are occurring with your async example.
Another unlikely possibility is that your keySource is for some reason returning duplicate partition keys, so when you do the writes, some of them end up overwriting a previously inserted row and don't increase the row count. But that should impact the sync version too, so that's why I say it's unlikely.
I would try writing smaller sets than 9 million and at a slow rate and see if the problem only starts to happen at a certain number of inserts or certain rate of inserts. If the number of inserts has an impact, then I'd suspect something is wrong with the row keys in the data. If the rate of inserts has an impact, then I'd suspect hot spots causing write timeout errors.
One other thing to check would be the Cassandra log file, to see if there are any exceptions being reported there.
Addendum: 12/30/14
I tried to reproduce the symptom using your sample code with Cassandra 2.1.2 and driver 2.1.3. I used a single table with a key of an incrementing number so that I could see gaps in the data. I did a lot of async inserts (30 at a time per thread in 10 threads all using one global session). Then I did a "select count (*)" of the table, and indeed it reported fewer rows in the table than expected. Then I did a "select *" and dumped the rows to a file and checked for missing keys. They seemed to be randomly distributed, but when I queried for those missing individual rows, it turned out they were actually present in the table. Then I noticed every time I did a "select count (*)", it came back with a different number, so it seems to be giving an approximation of the number of rows in the table rather than the actual number.
So I revised the test program to do a read back phase after all the writes, since I know all the key values. When I did that, all the async writes were present in the table.
So my question is, how are you checking the number of rows that are in your table after you finish writing? Are you querying for each individual key value or using some kind of operation like "select *"? If the latter, that seems to give most of the rows, but not all of them, so perhaps your data is actually present. Since no exceptions are being thrown, it seems to suggest that the writes are all successful. The other question would be, are you sure your key values are unique for all 9 million rows.