How to calculate mean and standard deviation given a PySpark DataFrame?

后端 未结 3 1041
礼貌的吻别
礼貌的吻别 2021-02-07 14:29

I have PySpark DataFrame (not pandas) called df that is quite large to use collect(). Therefore the below-given code is not efficient.

3条回答
  •  悲&欢浪女
    2021-02-07 15:19

    For Standard Deviation, better way of writing is as below. We can use formatting (to 2 decimal) and using the column Alias name

    data_agg=SparkSession.builder.appName('Sales_fun').getOrCreate()    
    data=data_agg.read.csv('sales_info.csv',inferSchema=True, header=True)
    
    from pyspark.sql.functions import *
    
    *data.select((format_number(stddev('Sales'),2)).alias('Sales_Stdev')).show()*
    

提交回复
热议问题