Spark sql Optimization Techniques loading csv to orc format of hive

独自空忆成欢 提交于 2020-04-30 07:15:04

问题


Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert) using spark submit as:

    spark-submit \
    --class class-name\
    --jar file

or can I add any extra Parameter in spark submit for improving the optimization.

scala code(sample):

    All Imports
    object demo {
    def main(args: Array[String]) {
    //sparksession with enabled hivesuppport

    var a1=sparksession.sql("load data inpath 'filepath'  overwrite into table table_name")

    var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from    source_table")

    }
    }

回答1:


I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)


2 step process is not needed here..

  • Read the dataframe like below sample...
val DFCsv = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("yourcsv")

  • if needed you have to do repartition(may be this is cause of the actual 4hr delay since you have not done) since its large file and then...

dfcsv.repartition(90) means it will/may repartition the csv data in to 90 almost equal parts. where 90 is sample number. you can mention what ever you want.

      DFCsv.write.format("orc")
    .partitionBy('yourpartitioncolumns')
    .saveAsTable('yourtable')

OR

     DFCsv.write.format("orc")
     .partitionBy('yourpartitioncolumns')
     .insertInto('yourtable')

Note: 1) For large data you need to do repartition to uniformly distribute the data will increase the parllelism and hence performance.

2) If you dont have patition columns and is non-partition table then no need of partitionBy in the above samples



来源:https://stackoverflow.com/questions/60984121/spark-sql-optimization-techniques-loading-csv-to-orc-format-of-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!