In Hadoop, when I use an inputformat reader the logs at the job level report how many records were read, it also displays the byte count etc.
In Spark when I use the sam
Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?
Count is distributed. In spark nomenclature, count is an "Action". All actions are distributed. Really, there are only a handful things that bring all of the data to the driver node and they are generally well documented (eg take, collect etc)
After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?
No, the data will not be in memory. If you want it to be, you need to explicitly cache before counting. Spark's lazy evaluation will not make any computations until an Action is taken. And no data will be stored in memory after an Action unless there was a cache call.
Is there a better way to do what I want to do, namely count the records before operating on them?
Cache, count, operating seems like a solid plan