[New to Spark]
After creating a DataFrame I am trying to partition it based on a column in the DataFrame. When I check the partitioner using data_frame.rdd.partitioner
That's to be expected. RDD
converted from a Dataset
doesn't preserve the partitioner, only the data distribution.
If you want to inspect partitioner of the RDD you should retrieve it from the queryExecution
:
scala> val df = spark.range(100).select($"id" % 3 as "id").repartition(42, $"id")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]
scala> df.queryExecution.toRdd.partitioner
res1: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@4be2340e)
how can I change the partitioner ?
In general you cannot. There exist repartitionByRange
method (see the linked thread), but otherwise Dataset
Partitioner
is not configurable.