发表新帖

发表新帖

Number of Partitions of Spark Dataframe

前端未结

关注

 1  681

青春惊慌失措 2021-02-06 08:06

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of part

1条回答

日久生厌 (楼主)

2021-02-06 08:22
You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:
- Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
- Datasets created from RDD inherit number of partitions from its parent.
- Datsets created using data source API:
  - In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
  - In Spark 2.x there is a Spark SQL specific configuration in use.
- Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题