I am able to print data in two RDD with the below code.
usersRDD.foreach(println)
empRDD.foreach(println)
I need to compare data in two RDDs. H
Of course the above solutions are complete and correct! Just one proposal , if and only if the RDDs are synchronized(Same rows have the same keys). You can use a distributed solution and exploit parallelism by using only spark transformations via the following tested solution:
def distrCompare(left: RDD[(Int,Int)], right: RDD[(Int,Int)]): Boolean = {
val rdd1 = left.join(right).map{case(k, (lv,rv)) => (k,lv-rv)}
val rdd2 = rdd1.filter{case(k,v)=>(v!=0)}
var equal = true;
rdd2.map{
case(k,v)=> if(v!=0) equal = false
}
return equal
}
You can choose the number of partitions in "join".