问题
Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert) using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object demo {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
回答1:
I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
2 step process is not needed here..
- Read the dataframe like below sample...
val DFCsv = spark.read.format("csv")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.load("yourcsv")
- if needed you have to do
repartition
(may be this is cause of the actual 4hr delay since you have not done) since its large file and then...
dfcsv.repartition(90)
means it will/may repartition the csv data in to 90 almost equal parts. where 90 is sample number. you can mention what ever you want.
DFCsv.write.format("orc")
.partitionBy('yourpartitioncolumns')
.saveAsTable('yourtable')
OR
DFCsv.write.format("orc")
.partitionBy('yourpartitioncolumns')
.insertInto('yourtable')
Note: 1) For large data you need to do repartition to uniformly distribute the data will increase the parllelism and hence performance.
2) If you dont have patition columns and is non-partition table then no need of
partitionBy
in the above samples
来源:https://stackoverflow.com/questions/60984121/spark-sql-optimization-techniques-loading-csv-to-orc-format-of-hive