Is there an “Explain RDD” in spark

前端 未结 2 1262
情书的邮戳
情书的邮戳 2021-02-19 02:55

In particular, if I say

rdd3 = rdd1.join(rdd2)

then when I call rdd3.collect, depending on the Partitioner used, eit

2条回答
  •  萌比男神i
    2021-02-19 03:34

    I think toDebugString will appease your curiosity.

    scala> val data = sc.parallelize(List((1,2)))
    data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at :21
    
    scala> val joinedData = data join data
    joinedData: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[11] at join at :23
    
    scala> joinedData.toDebugString
    res4: String =
    (8) MapPartitionsRDD[11] at join at :23 []
     |  MapPartitionsRDD[10] at join at :23 []
     |  CoGroupedRDD[9] at join at :23 []
     +-(8) ParallelCollectionRDD[8] at parallelize at :21 []
     +-(8) ParallelCollectionRDD[8] at parallelize at :21 []
    

    Each indentation is a stage, so this should run as two stages.

    Also, the optimizer is fairly decent, however I would suggest using DataFrames if you are using 1.3+ as the optimizer there is EVEN better in many cases:)

提交回复
热议问题