Spark: increase number of partitions without causing a shuffle?

后端 未结 3 1511
夕颜
夕颜 2021-02-07 03:22

When decreasing the number of partitions one can use coalesce, which is great because it doesn\'t cause a shuffle and seems to work instantly (doesn\'t require an a

相关标签:
3条回答
  • 2021-02-07 03:36

    I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.

    When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.

    If you want the process to go the other way, you could just force some kind of partitioning:

    [RDD].partitionBy(new HashPartitioner(100))
    

    I'm not sure that's what you're looking for, but hope so.

    0 讨论(0)
  • 2021-02-07 03:38

    Watch this space

    https://issues.apache.org/jira/browse/SPARK-5997

    This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.

    0 讨论(0)
  • 2021-02-07 03:47

    As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.

    You can write :

    sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
    # you spark code here with some transformation and at least one action
    df = df.withColumn("sum", sum(df.A).over(your_window_function))
    df.count() # your action
    
    df = df.filter(df.B <10)
    df = df.count()   
    
    sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
    # you reduce the number of partition because you know you will have a lot 
    # less data
    df = df.withColumn("max", max(df.A).over(your_other_window_function))
    df.count() # your action
    
    0 讨论(0)
提交回复
热议问题