Any performance issues forcing eager evaluation using count in spark?

前端 未结 1 1076
情歌与酒
情歌与酒 2020-11-29 10:42

Commonly I see Dataset.count throughout codebases in 3 scenarios:

  1. logging log.info(\"this ds has ${dataset.count} rows\")
  2. br
相关标签:
1条回答
  • 2020-11-29 11:15

    TL;DR 1) and 2) can be usually avoided but shouldn't harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice.

    Without cache

    Calling count alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control flow requirements can be usually (not always) mediated with a better pipeline design.

    Alone it won't have any impact on execution plan (execution plan for count, is normally different from the execution plan of the parent anyway. In general Spark does as little work as possible, so it will remove parts of the execution plan, which are not required to compute count).

    With cache:

    count with cache is bad practice naively copied from patterns used with RDD API. It is already disputable with RDDs, but with DataFrame can break a lot of internal optimizations (selection and predicate pushdown) and technically speaking, is not even guaranteed to work.

    0 讨论(0)
提交回复
热议问题