In particular, if I say
rdd3 = rdd1.join(rdd2)
then when I call rdd3.collect
, depending on the Partitioner
used, eit
I think toDebugString
will appease your curiosity.
scala> val data = sc.parallelize(List((1,2)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at :21
scala> val joinedData = data join data
joinedData: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[11] at join at :23
scala> joinedData.toDebugString
res4: String =
(8) MapPartitionsRDD[11] at join at :23 []
| MapPartitionsRDD[10] at join at :23 []
| CoGroupedRDD[9] at join at :23 []
+-(8) ParallelCollectionRDD[8] at parallelize at :21 []
+-(8) ParallelCollectionRDD[8] at parallelize at :21 []
Each indentation is a stage, so this should run as two stages.
Also, the optimizer is fairly decent, however I would suggest using DataFrames
if you are using 1.3+ as the optimizer there is EVEN better in many cases:)