Cassandra Reading Benchmark with Spark

独自空忆成欢 提交于 2019-12-04 17:08:25

I ran a similar test with a spark worker running on each Cassandra node.

Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.

Here are the times I got:

1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds

So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.

By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.

You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!