Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex
Suggestions are welcome to improve our knowledge.
map(func) What does it do? Pass each element of the RDD through the supplied function; i.e. func
flatMap(func) “Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).”
Compare flatMap to map in the following
mapPartitions(func) Consider mapPartitions a tool for performance optimization. It won’t do much for you when running examples on your local machine compared to running across a cluster. It’s the same as map, but works with Spark RDD partitions. Remember the first D in RDD is “Distributed” – Resilient Distributed Datasets. Or, put another way, you could say it is distributed over partitions.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides a function with an Int value to indicate the index position of the partition.
If we change the above example to use a parallelize’d list with 3 slices, our output changes significantly: