I\'ve made some progress with my own question (how to load a dataframe from a python requests stream that is downloading a csv file?) on StackOverflow, but I\'m receiving a
You shouldn't iteratively merge distributed data structures without controlling number of partitions. You'll find a complete explanation what is going on in Stackoverflow due to long RDD Lineage but unfortunately DataFrames
are slightly trickier:
dfs = ... # A list of pyspark.sql.DataFrame
def unionAll(*dfs):
if not dfs:
raise ValueError()
first = dfs[0]
return df.sql_ctx.createDataFrame(
df._sc.union([df.rdd for df in dfs]), first.schema
)
unionAll(*dfs)