How do I load a lot of data at once in a Cassandra “cluster” of one node?

北城以北 提交于 2020-01-03 03:11:10

问题


I am working on a multi website system which uses Cassandra to handle all of its data needs.

When I first install a website, it adds 3918 pages (and growing) with many fields, attachments such as JS files, links between pages, etc.

At some point, my test "cluster" (one node) decides that the data is coming to fast and it times out or worst, Cassandra "crashes" because of an out of memory (OOM). More or less, from what I can see the 2Gb of RAM allocated by Cassandra fills up and then, more often than not, Cassandra does not control its available RAM and gets an OOM. When I don't get the OOM, I get timeouts.

Is there a call in the C/C++ driver to know whether the "cluster" is slow so I can wait for a while instead of pushing more data like crazy?

At this point, the only thing I can see is me doing a write (INSERT INTO ...) and getting a Timeout error. More precisely, this error: CASS_ERROR_SERVER_WRITE_TIMEOUT. I find it rather ugly to wait until I get such an error to start pacing my INSERTs in order to manage the load. Is that the only way?!


Update: I was able to avoid the OOM, but only by reducing the number of plugins that get installed on first website creation (I do not need to have all the plugins installed at once). This is not a good solution, if you ask me, because a Cassandra node should NOT just crash like that. This could (probably does happen to many) happen in production and that's intolerable to think that could happen any time the load goes a tad bit too high for a minute...


回答1:


What I personally do to load lots of data is using asynchronuous queries (that's in Python but I think you can do the same thing in C++). I insert my data in asynchronuous fashion, and put the responses into a list.

When I reached a certain number (1000 in my case), I browse my list and call the results of all my responses to block synchronously until all my queries passed.

This way, I can send lots of queries without overloading my cluster. Don't know if it's the most efficient way, but that's work good.




回答2:


Single node clusters are atypical (they're not antipatterns, but they're not the primary use case). You'll have to work around some traditional behaviors.

1) Use sync queries instead of asynchronous.

2) Make sure you use a real consistency level ( QUORUM ) even on a single node, as using ANY will let you be overwhelmed.

3) Measure your own query latency. If latencies increase pass a certain point (short of a full timeout), back off insertion rate (artificially sleep).

4) Tune the cassandra side of the connection. 2GB is pretty small, to run that effectively you'll need to do some tuning. You'll probably want to tune your memtable flush thresholds to encourage more frequent flushing, and maybe explicitly configure memtable sizes based on the size of your initial document set.




回答3:


See Cassandra Loader to ingest massive data into Cassandra.



来源:https://stackoverflow.com/questions/36689227/how-do-i-load-a-lot-of-data-at-once-in-a-cassandra-cluster-of-one-node

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!