Spark: increase number of partitions without causing a shuffle?

后端 未结 3 1505
夕颜
夕颜 2021-02-07 03:22

When decreasing the number of partitions one can use coalesce, which is great because it doesn\'t cause a shuffle and seems to work instantly (doesn\'t require an a

3条回答
  •  情书的邮戳
    2021-02-07 03:47

    As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.

    You can write :

    sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
    # you spark code here with some transformation and at least one action
    df = df.withColumn("sum", sum(df.A).over(your_window_function))
    df.count() # your action
    
    df = df.filter(df.B <10)
    df = df.count()   
    
    sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
    # you reduce the number of partition because you know you will have a lot 
    # less data
    df = df.withColumn("max", max(df.A).over(your_other_window_function))
    df.count() # your action
    

提交回复
热议问题