Calculate average using Spark Scala

后端 未结 4 1406
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-28 09:29

How do I calculate the Average salary per location in Spark Scala with below two data sets ?

File1.csv(Column 4 is salary)

Ram, 30, Engineer, 40000  
B         


        
相关标签:
4条回答
  • 2021-01-28 10:11

    I would use DataFrame API, this should work:

    val salary = sc.textFile("File1.csv")
                   .map(e => e.split(","))
                   .map{case Seq(name,_,_,salary) => (name,salary)}
                   .toDF("name","salary")
    
    val location = sc.textFile("File2.csv")
                     .map(e => e.split(","))
                     .map{case Seq(name,location) => (name,location)}
                     .toDF("name","location")
    
    import org.apache.spark.sql.functions._
    
    salary
      .join(location,Seq("name"))
      .groupBy($"location")
      .agg(
        avg($"salary").as("avg_salary")
      )
      .repartition(1)
      .write.csv("output.csv")
    
    0 讨论(0)
  • 2021-01-28 10:15

    You can read the CSV files as DataFrames, then join and group them to get the averages:

    val df1 = spark.read.csv("/path/to/file1.csv").toDF(
      "name", "age", "title", "salary"
    )
    
    val df2 = spark.read.csv("/path/to/file2.csv").toDF(
      "name", "location"
    )
    
    import org.apache.spark.sql.functions._
    
    val dfAverage = df1.join(df2, Seq("name")).
      groupBy(df2("location")).agg(avg(df1("salary")).as("average")).
      select("location", "average")
    
    dfAverage.show
    +-----------+-------+
    |   location|average|
    +-----------+-------+
    |Bangalore  |40000.0|
    |  Chennai  |50000.0|
    +-----------+-------+
    

    [UPDATE] For calculating average dimensions:

    // file1.csv:
    Ram,30,Engineer,40000,600*200
    Bala,27,Doctor,30000,800*400
    Hari,33,Engineer,50000,700*300
    Siva,35,Doctor,60000,600*200
    
    // file2.csv
    Hari,Bangalore
    Ram,Chennai
    Bala,Bangalore
    Siva,Chennai
    
    val df1 = spark.read.csv("/path/to/file1.csv").toDF(
      "name", "age", "title", "salary", "dimensions"
    )
    
    val df2 = spark.read.csv("/path/to/file2.csv").toDF(
      "name", "location"
    )
    
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types.IntegerType
    
    val dfAverage = df1.join(df2, Seq("name")).
      groupBy(df2("location")).
      agg(
        avg(split(df1("dimensions"), ("\\*")).getItem(0).cast(IntegerType)).as("avg_length"),
        avg(split(df1("dimensions"), ("\\*")).getItem(1).cast(IntegerType)).as("avg_width")
      ).
      select(
        $"location", $"avg_length", $"avg_width",
        concat($"avg_length", lit("*"), $"avg_width").as("avg_dimensions")
      )
    
    dfAverage.show
    +---------+----------+---------+--------------+
    | location|avg_length|avg_width|avg_dimensions|
    +---------+----------+---------+--------------+
    |Bangalore|     750.0|    350.0|   750.0*350.0|
    |  Chennai|     600.0|    200.0|   600.0*200.0|
    +---------+----------+---------+--------------+
    
    0 讨论(0)
  • 2021-01-28 10:17

    I would use dataframes: First read the dataframes such as:

    val salary = spark.read.option("header", "true").csv("File1.csv")
    val location = spark.read.option("header", "true").csv("File2.csv")
    

    if you don't have headers you would need to set the option to "false" and use withColumnRenamed to change the default names.

    val salary = spark.read.option("header", "false").csv("File1.csv").toDF("name", "age", "job", "salary")
    val location = spark.read.option("header", "false").csv("File2.csv").toDF("name", "location")
    

    now do the join:

    val joined = salary.join(location, "name")
    

    lastly do the average calculation:

    val avg = joined.groupby("location").agg(avg($"salary"))
    

    to save do:

    avg.repartition(1).write.csv("output.csv")
    
    0 讨论(0)
  • 2021-01-28 10:17

    You could do something like this:

    val salary = sc.textFile("File1.csv").map(_.split(",").map(_.trim))
    val location = sc.textFile("File2.csv").map(_.split(",").map(_.trim))
    val joined = salary.map(e=>(e(0),e(3).toInt)).join(location.map(e=>(e(0),e(1))))
    val locSalary = joined.map(v => (v._2._2, v._2._1))
    val averages = locSalary.aggregateByKey((0,0))((t,e) => (t._1 + 1, t._2 + e),
            (t1,t2) => (t1._1 + t2._1, t1._2 + t2._2)).mapValues(t => t._2/t._1)
    

    then averages.take(10) will give:

    res5: Array[(String, Int)] = Array((Chennai,50000), (Bangalore,40000))
    
    0 讨论(0)
提交回复
热议问题