I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf
as follows:
I see that it is very old question but maybe someone needs it now. When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count
is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets
) provide optimized cassandraCount
method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the
spark.sql.shuffle.partitions
to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions
is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset
creation or global aggregations like count(*)
(which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism
but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.