How to calculate mean and standard deviation given a PySpark DataFrame?

后端未结

关注

 3  1046

礼貌的吻别 2021-02-07 14:29

I have PySpark DataFrame (not pandas) called df that is quite large to use collect(). Therefore the below-given code is not efficient.

3条回答

悲&欢浪女 (楼主)

2021-02-07 15:19

For Standard Deviation, better way of writing is as below. We can use formatting (to 2 decimal) and using the column Alias name

data_agg=SparkSession.builder.appName('Sales_fun').getOrCreate()    
data=data_agg.read.csv('sales_info.csv',inferSchema=True, header=True)

from pyspark.sql.functions import *

*data.select((format_number(stddev('Sales'),2)).alias('Sales_Stdev')).show()*

0 讨论(0)

查看其它3个回答