unionAll resulting in StackOverflow

后端未结

关注

 1  539

I\'ve made some progress with my own question (how to load a dataframe from a python requests stream that is downloading a csv file?) on StackOverflow, but I\'m receiving a

相关标签:

1条回答

悲&欢浪女

2021-01-13 23:01
You shouldn't iteratively merge distributed data structures without controlling number of partitions. You'll find a complete explanation what is going on in Stackoverflow due to long RDD Lineage but unfortunately DataFrames are slightly trickier:
```
dfs = ... # A list of pyspark.sql.DataFrame

def unionAll(*dfs):
    if not dfs:
        raise ValueError()
    first = dfs[0]
    return df.sql_ctx.createDataFrame(
        df._sc.union([df.rdd for df in dfs]), first.schema
    )

unionAll(*dfs)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...