Pyspark - Aggregation on multiple columns

后端 未结 1 703
鱼传尺愫
鱼传尺愫 2021-02-13 03:15

I have data like below. Filename:babynames.csv.

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James            


        
1条回答
  •  鱼传尺愫
    2021-02-13 03:37

    1. Follow the instructions from the README to include spark-csv package
    2. Load data

      df = (sqlContext.read
          .format("com.databricks.spark.csv")
          .options(inferSchema="true", delimiter=";", header="true")
          .load("babynames.csv"))
      
    3. Import required functions

      from pyspark.sql.functions import count, avg
      
    4. Group by and aggregate (optionally use Column.alias:

      df.groupBy("year", "sex").agg(avg("percent"), count("*"))
      

    Alternatively:

    • cast percent to numeric
    • reshape to a format ((year, sex), percent)
    • aggregateByKey using pyspark.statcounter.StatCounter

    0 讨论(0)
提交回复
热议问题