Spark - How to do computation on N partitions and then write to 1 file

前端 未结 2 1270
清酒与你
清酒与你 2021-01-21 15:48

I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried i

相关标签:
2条回答
  • 2021-01-21 16:15

    Spark does lazy evaluation meaning that it won't execute anything until there is a call to an action. write and count are both actions that will tricker execution. Functions like map and filter are simple being executed while doing some action - not before doing some action.

    Now, your pipeline is extremely simple and you have only one action (write), so the map is being performed while writing the file. With the call to coalesce(1) you have, however, also told Spark to gather all data into one partition before performing the write action, and since map is part of what's being performed in the write action, map will also run in one partition.

    I hope this makes sense. I suggest you also have a read through some of the blog posts on how Spark works. This one from Cloudera, should give you some insight :)

    0 讨论(0)
  • 2021-01-21 16:22

    You want to use "repartition(1)" instead of "coalesce(1)". The issue is that "repartition" will happily do shuffling to accomplish its ends, while "coalesce" will not.

    "Coalesce" is much more efficient than "repartition", but has to be used carefully, or parallelism will end up being severely constrained as you have experienced. All the partitions "coalesce" merges into a particular result partition have to reside on the same node. The "coalesce(1)" call demands a single result partition, so all partitions of "mapped_df" need to reside on a single node. To make that true, Spark shoehorns "mapped_df" into a single partition.

    0 讨论(0)
提交回复
热议问题