问题
I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster environment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
Using following configurration
cassandra.concurrent.writes=1500
cassandra.output.batch.size.rows=10
cassandra.output.batch.size.bytes=2048
cassandra.output.batch.grouping.key=partition
cassandra.output.consistency.level=LOCAL_QUORUM
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
Job is taking around 2 hrs , it really huge time
When I check logs I see WARN com.datastax.spark.connector.writer.QueryExecutor - BusyPoolException
how to fix this ?
回答1:
You have incorrect value for cassandra.concurrent.writes
- this means that you're sending 1500 concurrent batches at the same time. But by default, Java driver allows 1024 simultaneous requests. And usually, if you have too high number for this parameter, could lead to overload of the nodes, and as result - retries for tasks.
Also, other settings are incorrect - if you sepcify cassandra.output.batch.size.rows
, then its value overrides the value of cassandra.output.batch.size.bytes
. See corresponding section of the Spark Cassandra Connector reference for more details.
One of the aspects of performance tuning is to have correct number of Spark partitions, so you reach good parallelism - but this really depends on your code, how many nodes in Cassandra cluster, etc.
P.S. Also, please note that configuration parameters should be started with spark.cassandra.
, not with simple cassandra.
- if you specified them in this form, then these parameters are ignored and defaults are used.
来源:https://stackoverflow.com/questions/57865726/getting-busypoolexception-com-datastax-spark-connector-writer-queryexecutor-wh