问题
I have the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"])
df = df.set_index("a")
df.groupby("a").mean()
df.groupby("a").std()
I want to standardize the dataframe for each key and NOT standardize the whole column vector.
So for the following example the output would be:
a = 1:
Column: b
(2 - 2) / 0.0
(2 - 2) / 0.0
(2 - 2) / 0.0
Column: c
(3 - 2) / 1.0
(1 - 2) / 1.0
(2 - 2) / 1.0
And then I would get each value standardized per group
How can I do that in spark?
Thanks
回答1:
With Spark
DataFrame
:
sdf = spark.createDataFrame(df)
Imports:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
def z_score(c, w):
return (col(c) - mean(c).over(w)) / stddev(c).over(w)
Window:
w = Window.partitionBy("a")
Solution:
sdf.select("a", z_score("b", w).alias("a"), z_score("c", w).alias("b")).show()
+---+----+----+
| a| a| b|
+---+----+----+
| 1|null| 1.0|
| 1|null|-1.0|
| 1|null| 0.0|
| 2|-1.0|null|
| 2| 0.0|null|
| 2| 1.0|null|
+---+----+----+
来源:https://stackoverflow.com/questions/47250721/group-by-and-standardize-in-spark