There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:
Yes, a spark application has one and only Driver.
What is the relationship between
numWorkerNodes
andnumExecutors
?
A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
So `numWorkerNodes <= numExecutors'.
Is there any ration for them?
Personally, having worked in a fake cluster, where my laptop was the Driver and a virtual machine in the very same laptop was the worker, and in an industrial cluster of >10k nodes, I didn't need to care about that, since it seems that spark takes care of that.
I just use:
--num-executors 64
when I launch/submit my script and spark knows, I guess, how many workers it needs to summon (of course, by taking into account other parameters as well, and the nature of the machines).
Thus, personally, I don't know any such ratio.
Is there a known/generally-accepted/optimal ratio of
numDFRows
tonumPartitions
?
I am not aware of one, but as a rule of thumb you could rely on the product of #executors by #executor.cores, and then multiply that by 3 or 4. Of course this is a heuristic. In pyspark it would look like this:
sc = SparkContext(appName = "smeeb-App")
total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))
dataset = sc.textFile(input_path, total_cores * 3)
How does one calculate the 'optimal' number of partitions based on the size of the
DataFrame
?
That's a great question. Of course its hard to answer and it depends on your data, cluster, etc., but as discussed here with myself.
Too few partitions and you will have enormous chunks of data, especially when you are dealing with bigdata, thus putting your application in memory stress.
Too many partitions and you will have your hdfs taking much pressure, since all the metadata that has to be generated from the hdfs increases significantly as the number of partitions increase (since it maintains temp files, etc.). *
So what you want is too find a sweet spot for the number of partitions, which is one of the parts of fine tuning your application. :)
'rule of thumb' is:
numPartitions = numWorkerNodes * numCpuCoresPerWorker
, is it true?
Ah, I was writing the heuristic above before seeing this. So this is already answered, but take into account the difference of a worker and an executor.
* I just failed for this today: Prepare my bigdata with Spark via Python, when using too many partitions caused Active tasks is a negative number in Spark UI.