How to flatten list inside RDD?

后端 未结 3 1598
你的背包
你的背包 2020-12-30 05:07

Is it possible to flatten list inside RDD? For example convert:

 val xxx: org.apache.spark.rdd.RDD[List[Foo]]

to:

 val yyy:         


        
相关标签:
3条回答
  • 2020-12-30 05:23
    val rdd = sc.parallelize(Array(List(1,2,3), List(4,5,6), List(7,8,9), List(10, 11, 12)))
    // org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD ...
    
    val rddi = rdd.flatMap(list => list)
    // rddi: org.apache.spark.rdd.RDD[Int] = FlatMappedRDD ...
    
    // which is same as rdd.flatMap(identity)
    // identity is a method defined in Predef object.
    //    def identity[A](x: A): A
    
    rddi.collect()
    // res2: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
    
    0 讨论(0)
  • 2020-12-30 05:26

    You just need to flatten it, but as there's no explicit 'flatten' method on RDD, you can do this:

    rdd.flatMap(identity)
    
    0 讨论(0)
  • 2020-12-30 05:26

    You could pimp the RDD class to attach a .flatten method (in order to follow the List api):

    object SparkHelper {
      implicit class SeqRDDExtensions[T: ClassTag](val rdd: RDD[Seq[T]]) {
        def flatten: RDD[T] = rdd.flatMap(identity)
      }
    }
    

    which can then simply be used as such:

    rdd.flatten
    
    0 讨论(0)
提交回复
热议问题