If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following:
df[\"
Actually, there is an idiomatic way to do this in Spark, using the Hive OVER
expression.
i.e.
df.registerTempTable('df')
with_category_means = sqlContext.sql('select *, mean(Values) OVER (PARTITION BY Category) as category_mean from df')
Under the hood, this is using a window function. I'm not sure if this is faster than your solution, though