What is a glom?. How it is different from mapPartitions?

后端未结

关注

 3  1218

慢半拍i

I\'ve come across the glom() method on RDD. As per the documentation

Return an RDD created by coalescing all elements within each partition

相关标签:

3条回答

礼貌的吻别

2021-01-31 22:39
Does glom shuffle the data across partitions

No, it doesn't

If this is the second case I believe that the same can be achieved using mapPartitions

It can:
```
rdd.mapPartitions(iter => Iterator(_.toArray))
```
but the same thing applies to any non shuffling transformation like map, flatMap or filter.

if there are any use cases which benefit from glob.

Any situation where you need to access partition data in a form that is traversable more than once.
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-01-31 22:48

"... Glom() In general, spark does not allow the worker to refer to specific elements of the RDD. Keeps the language clean, but can be a major limitation. glom() transforms each partition into a tuple (immutabe list) of elements. Creates an RDD of tules. One tuple per partition. workers can refer to elements of the partition by index. but you cannot assign values to the elements, the RDD is still immutable. Now we can understand the command used above to count the number of elements in each partition. We use glom() to make each partition into a tuple. We use len on each partition to get the length of the tuple - size of the partition. * We collect the results to print them out.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-01-31 23:03

glom() transforms each partition into a tuple (immutabe list) of elements. It creates an RDD of tuples. One tuple per partition.

0 讨论(0)
发布评论:

提交评论
- 加载中...