I have a dataframe of the following form:
import scala.util.Random
val localData = (1 to 100).map(i => (i,Seq.fill(Math.abs(Random.nextGaussian()*100).toIn
As you said, you can increase the amount of partitions. I usually use a multiple of the number of cores: spark context default parallelism * 2-3..
In your case, you could use a bigger multiplier.
Another solution would be to filter split your df in this way:
You could then repartition each of them, perform computation and union them back.
Beware that repartitionning may be expensive since you have large rows to shuffle around.
You could have a look at theses slides (27+): https://www.slideshare.net/SparkSummit/custom-applications-with-sparks-rdd-spark-summit-east-talk-by-tejas-patil
They were experiencing very bad data skew and had to handle it in an interesting way.