In Spark is counting the records in an RDD expensive task?

前端 未结 1 2046
深忆病人
深忆病人 2021-02-08 14:20

In Hadoop, when I use an inputformat reader the logs at the job level report how many records were read, it also displays the byte count etc.

In Spark when I use the sam

相关标签:
1条回答
  • 2021-02-08 14:51

    Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?

    Count is distributed. In spark nomenclature, count is an "Action". All actions are distributed. Really, there are only a handful things that bring all of the data to the driver node and they are generally well documented (eg take, collect etc)

    After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?

    No, the data will not be in memory. If you want it to be, you need to explicitly cache before counting. Spark's lazy evaluation will not make any computations until an Action is taken. And no data will be stored in memory after an Action unless there was a cache call.

    Is there a better way to do what I want to do, namely count the records before operating on them?

    Cache, count, operating seems like a solid plan

    0 讨论(0)
提交回复
热议问题