Pandas-style transform of grouped data on PySpark DataFrame

后端 未结 3 808
悲&欢浪女
悲&欢浪女 2021-02-07 10:24

If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following:

df[\"         


        
3条回答
  •  孤独总比滥情好
    2021-02-07 11:00

    You can use Window to do this

    i.e.

    import pyspark.sql.functions as F
    from pyspark.sql.window import Window
    
    window_var = Window().partitionBy('Categroy')
    df = df.withColumn('DemeanedValues', F.col('Values') - F.mean('Values').over(window_var))
    

提交回复
热议问题