Commonly I see Dataset.count
throughout codebases in 3 scenarios:
log.info(\"this ds has ${dataset.count} rows\")
TL;DR 1) and 2) can be usually avoided but shouldn't harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice.
Without cache
Calling count
alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control flow requirements can be usually (not always) mediated with a better pipeline design.
Alone it won't have any impact on execution plan (execution plan for count, is normally different from the execution plan of the parent anyway. In general Spark does as little work as possible, so it will remove parts of the execution plan, which are not required to compute count).
With cache
:
count
with cache
is bad practice naively copied from patterns used with RDD API. It is already disputable with RDDs
, but with DataFrame
can break a lot of internal optimizations (selection and predicate pushdown) and technically speaking, is not even guaranteed to work.