I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.cr
If you want a specific column :
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]
The following should work:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()
The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s
I think the simplest way:
df.groupBy().sum().collect()
will return a list. In your example:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130
This is another way you can do this. using agg
and collect
:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]
sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()