How to get data from a specific partition in Spark RDD?

后端 未结 1 1575
耶瑟儿~
耶瑟儿~ 2021-02-04 17:32

I want to access data from a particular partition in Spark RDD. I can get address of a partition as follow:

myRDD.partitions(0)

But I want to g

1条回答
  •  灰色年华
    2021-02-04 18:03

    You can use mapPartitionsWithIndex as follows

    // Create (1, 1), (2, 2), ..., (100, 100) dataset
    // and partition by key so we know what to expect
    val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
      .partitionBy(new org.apache.spark.HashPartitioner(8))
    
    val zeroth = rdd
      // If partition number is not zero ignore data
      .mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())
    
    // Check if we get expected results 8, 16, ..., 96
    assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)
    

    0 讨论(0)
提交回复
热议问题