I\'ve come across the glom()
method on RDD. As per the documentation
Return an RDD created by coalescing all elements within each partition
"... Glom()
In general, spark does not allow the worker to refer to specific elements of the RDD.
Keeps the language clean, but can be a major limitation.
glom() transforms each partition into a tuple (immutabe list) of elements.
Creates an RDD of tules. One tuple per partition.
workers can refer to elements of the partition by index.
but you cannot assign values to the elements, the RDD is still immutable.
Now we can understand the command used above to count the number of elements in each partition.
We use glom()
to make each partition into a tuple.
We use len
on each partition to get the length of the tuple - size of the partition.
* We collect
the results to print them out.