问题
I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset.
Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc.
Is it because of parquet? How to make sure that cache is distributed on multiple executors?
Here is the spark config:
- Executors : 3
- Core: 4
- Memory: 4GB
- Partition: 200
tried repartition and adjusting config but no luck.
回答1:
For anyone who comes across this thread in future, have a similar experience to share. I was building an ML model with 400K rows and 20K features, in one 25M parquet file. All the optimisations I tried w.r.t partitions or executors failed to work. All the .fit
calls were using one executor only. After struggling for a week, I broke the data into multiple file chunks of 500 rows each, and suddenly all the optimisations kick in, and was able to train within a few minutes instead of hours earlier.
Maybe some Spark expert can help explain why such is the case, but if you are struggling with non-operative optimisations, this may work for you.
回答2:
I am answering my own question but it is interesting finding and It's worth sharing as @thebluephantom suggested.
So here the situation was in spark code I was reading data from 3 hive parquet tables and building the dataset. Now in my case, I am reading almost all columns from each table (approx 502 columns) and parquet is not ideal for this situation. But the interesting thing was spark was not creating blocks(partitions) for my data and caching entire dataset(~2GB) in just one executor.
Moreover, during my iterations, only one executor was doing all of the tasks.
Also, spark.default.parallelism
and spark.sql.shuffle.partitions
were not in my control. After changing it to Avro format I could actually tune the partitions, shuffles, each executor tasks etc. as per my need.
Hope this helps! Thank you.
来源:https://stackoverflow.com/questions/53035778/spark-dataset-cache-is-using-only-one-executor