How do I calculate the Average salary per location in Spark Scala with below two data sets ?
File1.csv(Column 4 is salary)
Ram, 30, Engineer, 40000
B
I would use dataframes: First read the dataframes such as:
val salary = spark.read.option("header", "true").csv("File1.csv")
val location = spark.read.option("header", "true").csv("File2.csv")
if you don't have headers you would need to set the option to "false" and use withColumnRenamed to change the default names.
val salary = spark.read.option("header", "false").csv("File1.csv").toDF("name", "age", "job", "salary")
val location = spark.read.option("header", "false").csv("File2.csv").toDF("name", "location")
now do the join:
val joined = salary.join(location, "name")
lastly do the average calculation:
val avg = joined.groupby("location").agg(avg($"salary"))
to save do:
avg.repartition(1).write.csv("output.csv")