I would like to know if the foreachPartitions
will results in better performance, due to an higher level of parallelism, compared to the foreach
m
foreachPartition
is only helpful when you're iterating through data which you are aggregating by partition.
A good example is processing clickstreams per user. You'd want to clear your calculation cache every time you finish a user's stream of events, but keep it between records of the same user in order to calculate some user behavior insights.