Efficient way of joining multiple tables in Spark - No space left on device

后端 未结 2 677
栀梦
栀梦 2021-01-06 20:50

A similar question has been asked here, but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows a

相关标签:
2条回答
  • 2021-01-06 21:25

    1st try to persist your big df every N iterations with a for loop (that you probably have already)

    2nd try to control the default partition number by setting sqlContext.sql("set spark.sql.shuffle.partitions=100") instead of 200 that is the default.

    Your code should look like:

    num_partitions = 10
    big_df = spark.createDataFrame(...) #empty df
    for i in range(num_partitions):
       big_df = big_df.join(df, ....)
    
       if i % num_partitions == 0:
         big_df = big_df.persist()
    

    Here I call persist every 10 iterations you can of course adjust that number according to the behavior of your job.

    EDIT: In your case you are persisting the local df_temp inside the rev function but not the whole dataframe which contains all the previous joins (df in your case). This will have no effect in the final execution plan since it is a local persist. As for my suggestion let's assume that you need in total 100 joins then with the code above you should iterate through a loop [1..100] and persist the accumulated results every 10 iterations. After persisting the big dataframe the DAG will contain less in-memory calculations since the intermediate steps will be stored and Spark knows how to restore them from the storage instead of recalculating everything from scratch.

    0 讨论(0)
  • 2021-01-06 21:29

    I've had similar problem in past except didn't have that many RDDs. The most efficient solution I could find was to use the low level RDD API. First store all the RDDs so that they are (hash) partitioned and sorted within partitions by the join column(s): https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartitionAndSortWithinPartitions-org.apache.spark.Partitioner-

    After this the join can be implemented using zip partitions without shuffling or using much memory: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/RDD.html#zipPartitions-org.apache.spark.rdd.RDD-boolean-scala.Function2-scala.reflect.ClassTag-scala.reflect.ClassTag-

    0 讨论(0)
提交回复
热议问题