Efficient pyspark join

前端 未结 2 956
慢半拍i
慢半拍i 2020-11-29 11:35

I\'ve read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I\'ve found are basically:

  • Use a broadcast join if you can. (<
相关标签:
2条回答
  • 2020-11-29 12:20

    you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table. It was nicely explained by Sim. see link below

    two pass approach to join big dataframes in pyspark

    based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.

    Here is the code.

    from pyspark.sql.functions import *
    emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
    emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
    

    So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition. you can then read and loop through each sub partition data and join both the dataframes and persist them together.

    counter =0;
    paritioncount = 4;
    while counter<=paritioncount:
        query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
        query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
        EMP_DF1 =spark.sql(query1)
        EMP_DF2 =spark.sql(query2)
        df1 = EMP_DF1.alias('df1')
        df2 = EMP_DF2.alias('df2')
        innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
        innerjoin_EMP.show()
        innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
        counter = counter +1
    

    I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.

    0 讨论(0)
  • 2020-11-29 12:30

    Thank you @vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:

    I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.

    The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

    0 讨论(0)
提交回复
热议问题