Spark DataFrame partitioner is None

后端 未结 1 1971
忘了有多久
忘了有多久 2021-01-22 07:10

[New to Spark] After creating a DataFrame I am trying to partition it based on a column in the DataFrame. When I check the partitioner using data_frame.rdd.partitioner

相关标签:
1条回答
  • 2021-01-22 07:37

    That's to be expected. RDD converted from a Dataset doesn't preserve the partitioner, only the data distribution.

    If you want to inspect partitioner of the RDD you should retrieve it from the queryExecution:

    scala> val df = spark.range(100).select($"id" % 3 as "id").repartition(42, $"id")
    df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]
    
    scala> df.queryExecution.toRdd.partitioner
    res1: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@4be2340e)
    

    how can I change the partitioner ?

    In general you cannot. There exist repartitionByRange method (see the linked thread), but otherwise Dataset Partitioner is not configurable.

    0 讨论(0)
提交回复
热议问题