How do I calculate the Average salary per location in Spark Scala with below two data sets ?
File1.csv(Column 4 is salary)
Ram, 30, Engineer, 40000
B
You could do something like this:
val salary = sc.textFile("File1.csv").map(_.split(",").map(_.trim))
val location = sc.textFile("File2.csv").map(_.split(",").map(_.trim))
val joined = salary.map(e=>(e(0),e(3).toInt)).join(location.map(e=>(e(0),e(1))))
val locSalary = joined.map(v => (v._2._2, v._2._1))
val averages = locSalary.aggregateByKey((0,0))((t,e) => (t._1 + 1, t._2 + e),
(t1,t2) => (t1._1 + t2._1, t1._2 + t2._2)).mapValues(t => t._2/t._1)
then averages.take(10)
will give:
res5: Array[(String, Int)] = Array((Chennai,50000), (Bangalore,40000))