问题
I am hoping to generate an explain/execution plan in Spark 2.2 with some actions on a dataframe. The goal here is to ensure that partition pruning is occurring as expected before I kick off the job and consume cluster resources. I tried a Spark documentation search and a SO search here but couldn't find a syntax that worked for my situation.
Here is a simple example that works as expected:
scala> List(1, 2, 3, 4).toDF.explain
== Physical Plan ==
LocalTableScan [value#42]
Here's an example that is not working as expected but hoping to get to work:
scala> List(1, 2, 3, 4).toDF.count.explain
<console>:24: error: value explain is not a member of Long
List(1, 2, 3, 4).toDF.count.explain
^
And here's a more detailed example to further exhibit the end goal of the partition pruning that I am hoping to confirm via explain plan.
val newDf = spark.read.parquet(df).filter(s"start >= ${startDt}").filter(s"start <= ${endDt}")
Thanks in advance for any thoughts/feedback.
回答1:
count
method is eagerly evaluated and as you see returns Long
, so there is no execution plan available.
You have to use a lazy transformation, either:
import org.apache.spark.sql.functions.count
df.select(count($"*"))
or
df.groupBy().agg(count($"*"))
来源:https://stackoverflow.com/questions/50575490/spark-2-x-how-to-generate-simple-explain-execution-plan