Spark task duration difference

强颜欢笑 提交于 2019-12-24 01:33:30

问题


I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage task duration distribution Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ? Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.


回答1:


There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..

  • Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
  • Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
    • extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
    • no apparent correlation of data size to the stragglers
    • different nodes/workers may experience the delays on subsequent runs of the same job

One thing to check: do hdfs dfsadmin -report and hdfs fsck to see if hdfs were healthy.



来源:https://stackoverflow.com/questions/37899448/spark-task-duration-difference

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!