Spark optimization - joins - very low number of task - OOM

问题

My spark application fail with this error : Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
This is what i get when I inspect the containger log : java.lang.OutOfMemoryError: Java heap space

My application is mainly get a table then join differents tables that i read from aws S3:

var result = readParquet(table1)  
val table2 = readParquet(table2)

result = result.join(table2 , result(primaryKey) === table2(foreignKey))

val table3 = readParquet(table3)

result = result.join(table3 , result(primaryKey) === table3(foreignKey))

val table4 = readParquet(table4)

result = result.join(table4 , result(primaryKey) === table4(foreignKey))

and so on

My application fail when i try to save my result dataframe to postgresql using :

result.toDF(df.columns.map(x => x.toLowerCase()): _*).write
  .mode("overwrite")
  .format("jdbc")
  .option(JDBCOptions.JDBC_TABLE_NAME, table)
  .save()

On my failed join Stage i have a very low number of task : 6 tasks for 4 executors

Why my Stage stage generate 2 jobs ?

The first one is completed with 426 task :

and the second one is failing :

My spark-submit conf :

dynamicAllocation = true  
num core = 2
driver memory = 6g
executor memory = 6g
max num executor = 10
min num executor = 1
spark.default.parallelism = 400
spark.sql.shuffle.partitions = 400

I tried with more resources but same problem :

 num core = 5
 driver memory = 16g
 executor memory = 16g
 num executor = 20

I think that all the data go to same partition/executor even with a default number of 400 partition and this cause a OOM error

I tried (without success) : persit data
broadcastJoin, but my table is not small enough to broadcast it at the end.
repartition to higher number (4000) an do a count between each join to perform a action :

my main table seam to growth very fast :
(number of rows ) 40 -> 68 -> 7304 -> 946 832 -> 123 032 864 -> 246 064 864 -> (too much time after )
However the data size seam very low

If i look at task metrics a interesting thing is that my data seam skewed ( i am realy not sure )
In the last count action, i can see that ~120 task perform action , with ~10MB of input data for 100 Records and 12 seconds and the other 3880 tasks do absolutly nothings ( 3ms , 0 records 16B ( metadata ? ) ):

回答1:

driver memory = 16g is too high memory and not needed. use only when you have a huge collection of data to master by actions like (collect() ) make sure to increase spark.maxResult.size if that is the case

you can do the following things

-- Do repartition while reading files readParquet(table1).repartition(x).if one of the tables is small then you can broadcast that and remove join instead use mapPartition and use a broadcast variable as lookup cache.

(OR)

-- Select a column that is uniformly distributed and repartition your table accordingly using that particular column.

Two points I need to press by looking in the above stats. your job has high scheduling delay which is caused by too many tasks and your task stats few stats are launched with input data as 10 bytes and few launched with 9MB.... obviously, there is data skewness here ... as you said The first one is completed with 426 tasks but with 4000 as repartition count it should launch more tasks

please look at https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c ... for more insights.

来源：https://stackoverflow.com/questions/63558408/spark-optimization-joins-very-low-number-of-task-oom

标签

postgresql

amazon-web-services

scala

apache-spark

out-of-memory