Experts, I am noticing one peculiar thing with one of the Pyspark jobs in production(running in YARN cluster mode). After executing for around an hour + (around 65-75 mins),
Without any apparent stack trace it's a good idea to think of a problem from two angles: it's either a code issue or a data issue.
Either case you should start by giving the driver abundant memory so as to rule that out as a probable cause. Increase driver.memory
and driver.memoryOverhead
until you've diagnosed the problem.
Common code issues:
Too many transformations causes the lineage to get too big. If there's any kind of iterative operations happening on the dataframe then it's a good idea to truncate the DAG by doing a checkpoint
in between. In Spark 2.x you can call dataFrame.checkpoint()
directly and not have to access the RDD
. Also @Sagar's answer describes how to do this for Spark 1.6
Trying to broadcast dataframes that are too big. This will usually result in an OOM exception but can sometimes just cause the job to seem stuck. Resolution is to not call broadcast
if you are explicitly doing so. Otherwise check if you've set spark.sql.autoBroadcastJoinThreshold
to some custom value and try lowering that value or disable broadcast altogether (setting -1
).
Not enough partitions can cause every task to run hot. Easiest way to diagnose this is to check the stages view on the Spark UI and see the size of data being read and written per task. This should ideally be in 100MB-500MB range. Otherwise increase spark.sql.shuffle.partitions
and spark.default.parallelism
to higher values than the default 200.
Common data issues:
Data skew. Since your job is failing for a specific workload it could have data skew in the specific job. Diagnose this by checking that the median time for task completion is comparable to the 75 percentile which is comparable to the 90 percentile on the stage view in the Spark UI. There are many ways to redress data skew but the one I find best is to write a custom join function that salts the join keys prior to join. This splits the skewed partition into several smaller partitions at the expense of a constant size data explosion.
Input file format or number of files. If your input file isn't partitioned and you're only doing narrow transforms (those that do not cause a data shuffle) then all of your data will run through a single executor and not really benefit from the distributed cluster setup. Diagnose this from the Spark UI by checking how many tasks are getting created in each stage of the pipeline. It should be of the order of your spark.default.parallelism
value. If not then do a .repartition(<some value>)
immediately after the data read step prior to any transforms. If the file format is CSV (not ideal) then verify that you have multiLine
disabled unless required in your specific case, otherwise this forces a single executor to read the entire csv file.
Happy debugging!
Are you breaking the lineage? If not then the issue might be with lineage. Can you try breaking the lineage in between the code somewhere and try it.
#Spark 1.6 code
sc.setCheckpointDit('.')
#df is the original dataframe name you are performing transformations on
dfrdd = df.rdd
dfrdd.checkpoint()
df=sqlContext.createDataFrame(dfrdd)
print df.count()
Let me know if it helps.