Spark: Is “count” on Grouped Data a Transformation or an Action?

前端 未结 3 806
你的背包
你的背包 2021-02-13 20:06

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following

scala> val empDF = Seq((1,\"         


        
相关标签:
3条回答
  • 2021-02-13 20:44

    As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.

    However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.

    Here however it is safe to say count is a transformation.

    0 讨论(0)
  • 2021-02-13 20:49

    The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset

    The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD

    EDIT:

    always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:

    empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
    
    0 讨论(0)
  • 2021-02-13 21:00

    Case 1:

    You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.

    for ex: rdd.count // it returns a Long value

    Case 2:

    If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.

    for ex: df.count // it returns a Long value

    Case 3:

    In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.

    for ex:

     df.groupBy("department") // returns RelationalGroupedDataset
              .count // returns a Dataframe so a transformation
              .count // returns a Long value since called on DF so an action
    
    0 讨论(0)
提交回复
热议问题