PySpark - Sum a column in dataframe and return results as int

前端 未结 6 1390
执念已碎
执念已碎 2020-12-24 08:13

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

df = spark.cr         


        
相关标签:
6条回答
  • 2020-12-24 08:23

    If you want a specific column :

    import pyspark.sql.functions as F     
    
    df.agg(F.sum("my_column")).collect()[0][0]
    
    0 讨论(0)
  • 2020-12-24 08:25

    The following should work:

    df.groupBy().sum().rdd.map(lambda x: x[0]).collect()
    
    0 讨论(0)
  • 2020-12-24 08:27

    The simplest way really :

    df.groupBy().sum().collect()
    

    But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

    df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
    

    I tried on a bigger dataset and i measured the processing time:

    RDD and ReduceByKey : 2.23 s

    GroupByKey: 30.5 s

    0 讨论(0)
  • 2020-12-24 08:28

    I think the simplest way:

    df.groupBy().sum().collect()
    

    will return a list. In your example:

    In [9]: df.groupBy().sum().collect()[0][0]
    Out[9]: 130
    
    0 讨论(0)
  • 2020-12-24 08:38

    This is another way you can do this. using agg and collect:

    sum_number = df.agg({"Number":"sum"}).collect()[0]
    
    result = sum_number["sum(Number)"]
    
    0 讨论(0)
  • 2020-12-24 08:40

    sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()

    import pyspark.sql.functions as F                                                    
    df.groupBy().agg(F.sum('Number')).show()
    
    0 讨论(0)
提交回复
热议问题