Spark: PySpark + Cassandra query performance

后端 未结 2 1658
你的背包
你的背包 2021-01-18 19:31

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:

         


        
相关标签:
2条回答
  • 2021-01-18 19:58

    I see that it is very old question but maybe someone needs it now. When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.

    It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".

    0 讨论(0)
  • 2021-01-18 19:59

    Is that the expected performance? If not, what am I missing?

    It looks slowish but it is not exactly unexpected. In general count is expressed as

    SELECT 1 FROM table
    

    followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.

    As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.

    Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?

    Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).

    If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:

    Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism

    As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.

    0 讨论(0)
提交回复
热议问题