I have data like below. Filename:babynames.csv.
year name percent sex
1880 John 0.081541 boy
1880 William 0.080511 boy
1880 James
Load data
df = (sqlContext.read
.format("com.databricks.spark.csv")
.options(inferSchema="true", delimiter=";", header="true")
.load("babynames.csv"))
Import required functions
from pyspark.sql.functions import count, avg
Group by and aggregate (optionally use Column.alias
:
df.groupBy("year", "sex").agg(avg("percent"), count("*"))
Alternatively:
percent
to numeric year
, sex
), percent
)aggregateByKey
using pyspark.statcounter.StatCounter