发表新帖

发表新帖

Any performance issues forcing eager evaluation using count in spark?

前端未结

关注

 1  1076

Commonly I see Dataset.count throughout codebases in 3 scenarios:

logging log.info(\"this ds has ${dataset.count} rows\")
br

相关标签:

1条回答

隐瞒了意图╮

2020-11-29 11:15

TL;DR 1) and 2) can be usually avoided but shouldn't harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice.

Without cache

Calling count alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control flow requirements can be usually (not always) mediated with a better pipeline design.

Alone it won't have any impact on execution plan (execution plan for count, is normally different from the execution plan of the parent anyway. In general Spark does as little work as possible, so it will remove parts of the execution plan, which are not required to compute count).

With cache:

count with cache is bad practice naively copied from patterns used with RDD API. It is already disputable with RDDs, but with DataFrame can break a lot of internal optimizations (selection and predicate pushdown) and technically speaking, is not even guaranteed to work.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题