Spark Dataset cache is using only one executor

问题

I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset.

Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc.

Is it because of parquet? How to make sure that cache is distributed on multiple executors?

Here is the spark config:

Executors : 3
Core: 4
Memory: 4GB
Partition: 200

tried repartition and adjusting config but no luck.

回答1:

For anyone who comes across this thread in future, have a similar experience to share. I was building an ML model with 400K rows and 20K features, in one 25M parquet file. All the optimisations I tried w.r.t partitions or executors failed to work. All the .fit calls were using one executor only. After struggling for a week, I broke the data into multiple file chunks of 500 rows each, and suddenly all the optimisations kick in, and was able to train within a few minutes instead of hours earlier.

Maybe some Spark expert can help explain why such is the case, but if you are struggling with non-operative optimisations, this may work for you.

回答2:

I am answering my own question but it is interesting finding and It's worth sharing as @thebluephantom suggested.

So here the situation was in spark code I was reading data from 3 hive parquet tables and building the dataset. Now in my case, I am reading almost all columns from each table (approx 502 columns) and parquet is not ideal for this situation. But the interesting thing was spark was not creating blocks(partitions) for my data and caching entire dataset(~2GB) in just one executor.

Moreover, during my iterations, only one executor was doing all of the tasks.

Also, spark.default.parallelism and spark.sql.shuffle.partitions were not in my control. After changing it to Avro format I could actually tune the partitions, shuffles, each executor tasks etc. as per my need.

Hope this helps! Thank you.

来源：https://stackoverflow.com/questions/53035778/spark-dataset-cache-is-using-only-one-executor

标签

apache-spark

yarn

parquet