I have an RDD with Array of elements like below, each element can be treated as tuple, Now question is i want to access only 4th element from first two tuples.. and loop thr
How to access Spark RDD Array of elements based on index
The answer is simply don't try. RDDs are not indexed, and depending on a context order of values can be nondeterministic.
As far as I understand what you want is simply a map
and sliding window:
import org.apache.spark.mllib.rdd.RDDFunctions._
// A dummy function
def doSomething(xs: Array[Int]) = xs match {
case Array(x1, x2) => if (x1 <= x2) x1 else x2
}
val rdd = sc.parallelize(Array(
(1, "Tom", "AAA", 2000),
(2, "Tim", "AAA", 3000),
(3, "Mark", "BBB", 6000),
(4, "Jim", "BBB", 6000),
(5, "James", "CCC", 4000)))
rdd.map(_._4).sliding(2).map(doSomething)
Above of course assumes that the order of values is defined or in other words ancestor lineage doesn't include shuffled RDDs.