Spark: Is “count” on Grouped Data a Transformation or an Action?

前端 未结 3 810
你的背包
你的背包 2021-02-13 20:06

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following

scala> val empDF = Seq((1,\"         


        
3条回答
  •  情歌与酒
    2021-02-13 21:00

    Case 1:

    You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.

    for ex: rdd.count // it returns a Long value

    Case 2:

    If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.

    for ex: df.count // it returns a Long value

    Case 3:

    In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.

    for ex:

     df.groupBy("department") // returns RelationalGroupedDataset
              .count // returns a Dataframe so a transformation
              .count // returns a Long value since called on DF so an action
    

提交回复
热议问题