unionAll resulting in StackOverflow

后端 未结 1 538
北恋
北恋 2021-01-13 22:29

I\'ve made some progress with my own question (how to load a dataframe from a python requests stream that is downloading a csv file?) on StackOverflow, but I\'m receiving a

相关标签:
1条回答
  • 2021-01-13 23:01

    You shouldn't iteratively merge distributed data structures without controlling number of partitions. You'll find a complete explanation what is going on in Stackoverflow due to long RDD Lineage but unfortunately DataFrames are slightly trickier:

    dfs = ... # A list of pyspark.sql.DataFrame
    
    def unionAll(*dfs):
        if not dfs:
            raise ValueError()
        first = dfs[0]
        return df.sql_ctx.createDataFrame(
            df._sc.union([df.rdd for df in dfs]), first.schema
        )
    
    unionAll(*dfs)
    
    0 讨论(0)
提交回复
热议问题