I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true") val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict") val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000") val result7D = hiveContext.sql("SET hive.merge.mapfiles=true") val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true") val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true") val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar") val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?