Spark sql Optimization Techniques loading csv to orc format of hive

问题

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert) using spark submit as:

    spark-submit \
    --class class-name\
    --jar file

or can I add any extra Parameter in spark submit for improving the optimization.

scala code(sample):

    All Imports
    object demo {
    def main(args: Array[String]) {
    //sparksession with enabled hivesuppport

    var a1=sparksession.sql("load data inpath 'filepath'  overwrite into table table_name")

    var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from    source_table")

    }
    }

回答1:

I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)

2 step process is not needed here..

Read the dataframe like below sample...

val DFCsv = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("yourcsv")

if needed you have to do repartition(may be this is cause of the actual 4hr delay since you have not done) since its large file and then...

dfcsv.repartition(90) means it will/may repartition the csv data in to 90 almost equal parts. where 90 is sample number. you can mention what ever you want.

      DFCsv.write.format("orc")
    .partitionBy('yourpartitioncolumns')
    .saveAsTable('yourtable')

     DFCsv.write.format("orc")
     .partitionBy('yourpartitioncolumns')
     .insertInto('yourtable')

Note: 1) For large data you need to do repartition to uniformly distribute the data will increase the parllelism and hence performance.

2) If you dont have patition columns and is non-partition table then no need of partitionBy in the above samples

来源：https://stackoverflow.com/questions/60984121/spark-sql-optimization-techniques-loading-csv-to-orc-format-of-hive

标签

scala

apache-spark

pyspark

apache-spark-sql

pyspark-sql