I try to create table from CSV file which is save into HDFS. The problem is that the csv consist line break inside of quote. Example of record in CSV:
There is right now no way to handle multilines csv in hive directly. However, there is some workaround:
produce a csv with \n
or \r\n
replaced with your own newline marker such <\br>
. You will be able to load it in hive. Then transform the resulting text by replacing the latter by the former
use spark, it has a multiline csv reader. This works out the box, while the csv beeing not read in a distributed way.
val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")
.write.format("orc")
.saveAsTable("myschma.myTable")
use an other format such parquet, avro, orc, sequence file, instead of a csv. For example you could use sqoop to produce them from a jdbc database. Or you could write your own program in java or python.
I found the solution. You can define your own InputFormatter. Then the DDL for HQL table will looks like this (At first you need to add your custom jar file):
ADD JAR /path/to/your/jar/CSVCustomInputFormat.jar;
DROP TABLE hive_database.hive_table;
CREATE EXTERNAL TABLE hive_database.hive_table
(
ID STRING,
PR_ID STRING,
SUMMARY STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
STORED AS
INPUTFORMAT 'com.hql.custom.formatter.CSVCustomInputFormatt'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/path/to/hdfs/dir/csv'
tblproperties('skip.header.line.count'='1');
Then how to create the custom input formatter you can see for example here: https://analyticsanvil.wordpress.com/2016/03/06/creating-a-custom-hive-input-format-and-record-reader-to-read-fixed-format-flat-files/