Running into an error I think being caused by the Window Function.
When I apply this script and persist just a few sample rows it works fine however when I apply it
By the stacktrace provided I believe the error comes from preparation of the execution plan, as it says:
Caused by: java.lang.StackOverflowError
at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:200)
I believe that the reason for that is because you call the method .withColumn
twice in the loop. What .withColumn
does in the Spark execution plan is basically a select
statement of all columns with 1 column changed as specified in the method. If you have 325 columns, then for single iteration this will call select on 325 columns twice -> 650 columns passed into the planner. Doing this 325 times you can see how it can create an overhead.
However it is very interesting though that you do not receive this error for a small sample, I'd expect otherwise.
Anyway you can try replacing your forwardFillImputer like this:
def forwardFillImputer(df, cols=[], partitioner="date", value="UNKNOWN"):
window = Window \
.partitionBy(F.month(partitioner)) \
.orderBy(partitioner) \
.rowsBetween(-sys.maxsize, 0)
imputed_cols = [F.last(stringReplacer(F.col(i), value), ignorenulls=True).over(window).alias(i)
for i in cols]
missing_cols = [F.col(i) for i in df.columns if i not in cols]
return df.select(missing_cols + imputed_cols)
This way you basically just parse into planner a single select statement, which should be easier to handle.
Just as a warning, generally Spark doesn't do well with high number of columns, so you might see other strange issues along the way.