How to access Spark RDD Array of elements based on index

前端 未结 1 1866
情歌与酒
情歌与酒 2020-12-07 04:29

I have an RDD with Array of elements like below, each element can be treated as tuple, Now question is i want to access only 4th element from first two tuples.. and loop thr

相关标签:
1条回答
  • 2020-12-07 04:46

    How to access Spark RDD Array of elements based on index

    The answer is simply don't try. RDDs are not indexed, and depending on a context order of values can be nondeterministic.

    As far as I understand what you want is simply a map and sliding window:

    import org.apache.spark.mllib.rdd.RDDFunctions._
    
    // A dummy function
    def doSomething(xs: Array[Int]) = xs match {
      case Array(x1, x2) => if (x1 <= x2) x1 else x2
    }
    
    val rdd = sc.parallelize(Array(
        (1, "Tom", "AAA", 2000),
        (2, "Tim", "AAA", 3000),
        (3, "Mark", "BBB", 6000),
        (4, "Jim", "BBB", 6000),
        (5, "James", "CCC", 4000)))
    
    rdd.map(_._4).sliding(2).map(doSomething)
    

    Above of course assumes that the order of values is defined or in other words ancestor lineage doesn't include shuffled RDDs.

    0 讨论(0)
提交回复
热议问题