How to repartition evenly in Spark?

前端 未结 1 614
广开言路
广开言路 2021-01-12 05:05

To test how .repartition() works, I ran the following code:

rdd = sc.parallelize(range(100))
rdd.getNumPartitions()

rdd.

相关标签:
1条回答
  • 2021-01-12 05:36

    The algorithm behind repartition() uses logic to optimize the most effective way to redistribute data across partitions. In this case, your range is very small and it doesn't find it optimal to actually break the data down further. If you were to use a much bigger range like 100000, you will find that it does in fact redistribute the data.

    If you want to force a certain amount of partitions, you could specify the number of partitions upon the intial load of the data. At this point, it will try to evenly distribute the data across partitions even if it's not necessarily optimal. The parallelize function takes a second argument for partitions

        rdd = sc.parallelize(range(100), 10)
    

    The same thing would work if you were to say read from a text file

        rdd = sc.textFile('path/to/file/, numPartitions)
    
    0 讨论(0)
提交回复
热议问题