I have PySpark DataFrame (not pandas) called df
that is quite large to use collect()
. Therefore the below-given code is not efficient.
For Standard Deviation, better way of writing is as below. We can use formatting (to 2 decimal) and using the column Alias name
data_agg=SparkSession.builder.appName('Sales_fun').getOrCreate()
data=data_agg.read.csv('sales_info.csv',inferSchema=True, header=True)
from pyspark.sql.functions import *
*data.select((format_number(stddev('Sales'),2)).alias('Sales_Stdev')).show()*