I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried i
Spark does lazy evaluation meaning that it won't execute anything until there is a call to an action. write
and count
are both actions that will tricker execution. Functions like map
and filter
are simple being executed while doing some action - not before doing some action.
Now, your pipeline is extremely simple and you have only one action (write
), so the map
is being performed while writing the file. With the call to coalesce(1)
you have, however, also told Spark to gather all data into one partition before performing the write
action, and since map
is part of what's being performed in the write
action, map
will also run in one partition.
I hope this makes sense. I suggest you also have a read through some of the blog posts on how Spark works. This one from Cloudera, should give you some insight :)
You want to use "repartition(1)" instead of "coalesce(1)". The issue is that "repartition" will happily do shuffling to accomplish its ends, while "coalesce" will not.
"Coalesce" is much more efficient than "repartition", but has to be used carefully, or parallelism will end up being severely constrained as you have experienced. All the partitions "coalesce" merges into a particular result partition have to reside on the same node. The "coalesce(1)" call demands a single result partition, so all partitions of "mapped_df" need to reside on a single node. To make that true, Spark shoehorns "mapped_df" into a single partition.