I am evaluating cassandra. I am using the datastax driver and CQL.
I would like to store some data with the following internal structure, where the names are differ
You have a mistake in your code that I think explains a lot of the performance problems you're seeing: for each batch you prepare the statement again. Preparing a statement isn't super expensive, but doing it as you do adds a lot of latency. The time you spend waiting for that statement to be prepared is time you don't build the batch, and time Cassandra doesn't spend processing that batch. A prepared statement only needs to be prepared once and should be re-used.
I think much of the bad performance can be explained latency problems. The bottleneck is most likely your application code, not Cassandra. Even if you only prepare that statement once, you still spend most of the time either being CPU bound in the application (building a big batch) or not doing anything (waiting for the network and Cassandra).
There are two things you can do: first of all use the async API of the CQL driver and build the next batch while the network and Cassandra are busy with the one you just completed; and secondly try running multiple threads doing the same thing. The exact number of threads you'll have to experiment with and will depend on the number of cores you have and if you're running one or three nodes on the same machine.
Running a three node cluster on the same machine makes the cluster slower than running a single node, while running on different machines makes it faster. Also running the application on the same machine doesn't exactly help. If you want to test performance, either run only one node or run a real cluster on separate machines.
Batches can give you extra performance, but not always. They can lead to the kind of problem you're seeing in your test code: buffer bloat. Once batches get too big your application spends too much time building them, then too much time pushing them out on the network, and too much time waiting for Cassandra to process them. You need to experiment with batch sizes and see what works best (but do that with a real cluster, otherwise you won't see the effects of the network, which will be a big factor when your batches get bigger).
And if you use batches, use compression. Compression makes no difference in most request loads (responses are another matter), but when you send huge batches it can make a big difference.
There's nothing special about wide row writes in Cassandra. With some exceptions the schema doesn't change the time it takes to process a write. I run applications that do tens of thousands of non-batched mixed wide-row and non-wide-row writes per second. The clusters aren't big, just three or four m1.xlarge EC2 nodes each. The trick is never to wait for an request to return before sending the next (that doesn't mean fire and forget, just handle the responses in the same asynchronous manner). Latency is a performance killer.
You are not the only person to experience this. I wrote a blog post a while ago focused more on conversion between CQL and thrift, but there are links to mail list issues of folks seeing the same thing (the performance issue of wide-row inserts were my initial motivations for investigating): http://thelastpickle.com/blog/2013/09/13/CQL3-to-Astyanax-Compatibility.html
In sum - CQL is great for removing the burdens of dealing with typing and understanding the data model for folks new to Cassandra. The DataStax driver is well written and contains lots of useful features.
However, the Thrift API is more than slightly faster for wide row inserts. The Netflix blog does not go in to this specific use case so much. Further, the Thrift API is not legacy so long as people are using it (many folks are). It's an ASF project and as such is not run by any single vendor.
In general, with any Cassandra-based application, if you find a way of doing something that meets (or often exceeds) the performance requirements of your workload, stick with it.
Some things you can try... In your cassandra.yaml
(this is Cassandra 1.2.x, maybe the params are called somewhat differently in 2.x):
row_cache_size_in_mb: 0
)min_memory_compaction_limit_in_mb
), only do this if you see some log output that sais that spilling does happennum_tokens
/ initial_token
values are configured properly so rows get distributed across your nodesOther things you can try:
Things to clarify:
nodetool
that the 3 nodes have found each
other?nodetool
say about the load distribution of your 3 nodes?