Compare data in two RDD in spark

后端 未结 2 1046
悲哀的现实
悲哀的现实 2021-02-10 05:47

I am able to print data in two RDD with the below code.

usersRDD.foreach(println)
empRDD.foreach(println)

I need to compare data in two RDDs. H

2条回答
  •  后悔当初
    2021-02-10 06:17

    Of course the above solutions are complete and correct! Just one proposal , if and only if the RDDs are synchronized(Same rows have the same keys). You can use a distributed solution and exploit parallelism by using only spark transformations via the following tested solution:

    def distrCompare(left: RDD[(Int,Int)], right: RDD[(Int,Int)]): Boolean = {
      val rdd1 = left.join(right).map{case(k, (lv,rv)) => (k,lv-rv)}
      val rdd2 = rdd1.filter{case(k,v)=>(v!=0)}
      var equal = true;
      rdd2.map{
        case(k,v)=> if(v!=0) equal = false
      }
      return equal
    }
    

    You can choose the number of partitions in "join".

提交回复
热议问题