发表新帖

发表新帖

In Spark is counting the records in an RDD expensive task?

前端未结

关注

 1  2050

In Hadoop, when I use an inputformat reader the logs at the job level report how many records were read, it also displays the byte count etc.

In Spark when I use the sam

相关标签:

1条回答

庸人自扰

2021-02-08 14:51

Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?

Count is distributed. In spark nomenclature, count is an "Action". All actions are distributed. Really, there are only a handful things that bring all of the data to the driver node and they are generally well documented (eg take, collect etc)

After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?

No, the data will not be in memory. If you want it to be, you need to explicitly cache before counting. Spark's lazy evaluation will not make any computations until an Action is taken. And no data will be stored in memory after an Action unless there was a cache call.

Is there a better way to do what I want to do, namely count the records before operating on them?

Cache, count, operating seems like a solid plan

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题