问题
I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage task duration distribution Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ? Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.
回答1:
There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..
- Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
- Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
- extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
- no apparent correlation of data size to the stragglers
- different nodes/workers may experience the delays on subsequent runs of the same job
One thing to check: do hdfs dfsadmin -report
and hdfs fsck
to see if hdfs were healthy.
来源:https://stackoverflow.com/questions/37899448/spark-task-duration-difference