Apache Spark - foreach Vs foreachPartitions When to use What?

后端 未结 5 1238
别那么骄傲
别那么骄傲 2020-11-28 06:59

I would like to know if the foreachPartitions will results in better performance, due to an higher level of parallelism, compared to the foreach m

5条回答
  •  有刺的猬
    2020-11-28 07:23

    foreachPartition is only helpful when you're iterating through data which you are aggregating by partition.

    A good example is processing clickstreams per user. You'd want to clear your calculation cache every time you finish a user's stream of events, but keep it between records of the same user in order to calculate some user behavior insights.

提交回复
热议问题