Spark: Is “count” on Grouped Data a Transformation or an Action?

前端 未结 3 805
你的背包
你的背包 2021-02-13 20:06

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following

scala> val empDF = Seq((1,\"         


        
3条回答
  •  一向
    一向 (楼主)
    2021-02-13 20:49

    The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset

    The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD

    EDIT:

    always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:

    empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
    

提交回复
热议问题