In particular, if I say
rdd3 = rdd1.join(rdd2)
then when I call rdd3.collect
, depending on the Partitioner
used, eit
I think toDebugString
will appease your curiosity.
scala> val data = sc.parallelize(List((1,2)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:21
scala> val joinedData = data join data
joinedData: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[11] at join at <console>:23
scala> joinedData.toDebugString
res4: String =
(8) MapPartitionsRDD[11] at join at <console>:23 []
| MapPartitionsRDD[10] at join at <console>:23 []
| CoGroupedRDD[9] at join at <console>:23 []
+-(8) ParallelCollectionRDD[8] at parallelize at <console>:21 []
+-(8) ParallelCollectionRDD[8] at parallelize at <console>:21 []
Each indentation is a stage, so this should run as two stages.
Also, the optimizer is fairly decent, however I would suggest using DataFrames
if you are using 1.3+ as the optimizer there is EVEN better in many cases:)
I would use Spark UI (the web page the spark context used to serve) instead of toDebugString
whenever I can. Much easier to comprehend, and a bit more information (and less glitches according my very limited experience). Also, Spark UI shows the number of Tasks and their input and output sizes for each Stage, which helps figuring out what it does.
Besides, there's very little information shown in both of them. Mostly just a graph of boxes saying MapPartitionsRDD [12]
and such, which doesn't tell much about what that step actually does. (For WholeStageCodegen
boxes the DEBUG
log under org.apache.spark.sql.execution
contains the generated code at least. But there's no any kind of ID logged to pair them with what you see on Spark UI.)