Using Python's reduce() to join multiple PySpark DataFrames

人盡茶涼 提交于 2019-12-22 10:40:04

问题


Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error:

def join_dataframes(list_of_join_columns, left_df, right_df):
    return left_df.join(right_df, on=list_of_join_columns)

joined_df = functools.reduce(
    functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes,
)

whereas this one doesn't:

joined_df = list_of_dataframes[0]
joined_df.cache()
for right_df in list_of_dataframes[1:]:
    joined_df = joined_df.join(right_df, on=list_of_join_columns)

Any ideas would be greatly appreciated. Thanks!


回答1:


One reason is that a reduce or a fold is usually functionally pure: the result of each accumulation operation is not written to the same part of memory, but rather to a new block of memory.

In principle the garbage collector could free the previous block after each accumulation, but if it doesn't you'll allocate memory for each updated version of the accumulator.




回答2:


As long as you use CPython (different implementations can, but realistically shouldn't, exhibit significantly different behavior in this specific case). If you take a look at reduce implementation you'll see it is just a for-loop with minimal exception handling.

The core is exactly equivalent to the loop you use

for element in it:
    value = function(value, element)

and there is no evidence supporting claims of any special behavior.

Additionally simple tests with number of frames practical limitations of Spark joins (joins are among the most expensive operations in Spark)

dfs = [
    spark.range(10000).selectExpr(
        "rand({}) AS id".format(i), "id AS value",  "{} AS loop ".format(i)
    )
    for i in range(200)
]

Show no significant difference in timing between direct for-loop

def f(dfs):
    df1 = dfs[0]
    for df2 in dfs[1:]:
        df1 = df1.join(df2, ["id"])
    return df1

%timeit -n3 f(dfs)                 
## 6.25 s ± 257 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

and reduce invocation

from functools import reduce

def g(dfs):
    return reduce(lambda x, y: x.join(y, ["id"]), dfs) 

%timeit -n3 g(dfs)
### 6.47 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

Similarly overall JVM behavior patterns are comparable between for-loop

For loop CPU and Memory Usage - VisualVM

and reduce

reduce CPU and Memory Usage - VisualVM

Finally both generate identical execution plans

g(dfs)._jdf.queryExecution().optimizedPlan().equals( 
    f(dfs)._jdf.queryExecution().optimizedPlan()
)
## True

which indicates no difference when plans is evaluated and OOMs are likely to occur.

In other words you correlation doesn't imply causation, and observed performance problems are unlikely to be related to the method you use to combine DataFrames.



来源:https://stackoverflow.com/questions/44977549/using-pythons-reduce-to-join-multiple-pyspark-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!