Spark Caused by: java.lang.StackOverflowError Window Function?

前端 未结 1 1686
别那么骄傲
别那么骄傲 2021-01-20 16:37

Running into an error I think being caused by the Window Function.

When I apply this script and persist just a few sample rows it works fine however when I apply it

1条回答
  •  清酒与你
    2021-01-20 17:13

    By the stacktrace provided I believe the error comes from preparation of the execution plan, as it says:

    Caused by: java.lang.StackOverflowError
        at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:200)
    

    I believe that the reason for that is because you call the method .withColumn twice in the loop. What .withColumn does in the Spark execution plan is basically a select statement of all columns with 1 column changed as specified in the method. If you have 325 columns, then for single iteration this will call select on 325 columns twice -> 650 columns passed into the planner. Doing this 325 times you can see how it can create an overhead.

    However it is very interesting though that you do not receive this error for a small sample, I'd expect otherwise.

    Anyway you can try replacing your forwardFillImputer like this:

    def forwardFillImputer(df, cols=[], partitioner="date", value="UNKNOWN"):
        window = Window \
         .partitionBy(F.month(partitioner)) \
         .orderBy(partitioner) \
         .rowsBetween(-sys.maxsize, 0)
    
        imputed_cols = [F.last(stringReplacer(F.col(i), value), ignorenulls=True).over(window).alias(i) 
                        for i in cols]
    
        missing_cols = [F.col(i) for i in df.columns if i not in cols]
    
        return df.select(missing_cols + imputed_cols)
    

    This way you basically just parse into planner a single select statement, which should be easier to handle.

    Just as a warning, generally Spark doesn't do well with high number of columns, so you might see other strange issues along the way.

    0 讨论(0)
提交回复
热议问题