I\'m running Spark 2 and am trying to shuffle around 5 terabytes of json. I\'m running into very long garbage collection pauses during shuffling of a Dataset<
Spark 2
Dataset<
Adding the following flags got rid of the GC pauses.
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
I think it does take a fair amount of tweaking though. This databricks post was very very helpful.