Calculate average using Spark Scala

后端 未结 4 1413
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-28 09:29

How do I calculate the Average salary per location in Spark Scala with below two data sets ?

File1.csv(Column 4 is salary)

Ram, 30, Engineer, 40000  
B         


        
4条回答
  •  旧时难觅i
    2021-01-28 10:17

    I would use dataframes: First read the dataframes such as:

    val salary = spark.read.option("header", "true").csv("File1.csv")
    val location = spark.read.option("header", "true").csv("File2.csv")
    

    if you don't have headers you would need to set the option to "false" and use withColumnRenamed to change the default names.

    val salary = spark.read.option("header", "false").csv("File1.csv").toDF("name", "age", "job", "salary")
    val location = spark.read.option("header", "false").csv("File2.csv").toDF("name", "location")
    

    now do the join:

    val joined = salary.join(location, "name")
    

    lastly do the average calculation:

    val avg = joined.groupby("location").agg(avg($"salary"))
    

    to save do:

    avg.repartition(1).write.csv("output.csv")
    

提交回复
热议问题