Hive table from CSV. The line termination in quotes

老子叫甜甜 提交于 2019-12-12 13:24:02

问题


I try to create table from CSV file which is save into HDFS. The problem is that the csv consist line break inside of quote. Example of record in CSV:

ID,PR_ID,SUMMARY
2063,1184,"This is problem field because consists line break

This is not new record but it is part of text of third column
"

I created hive table:

CREATE TEMPORARY EXTERNAL TABLE  hive_database.hive_table
(   
    ID STRING,
    PR_ID STRING,
    SUMMARY STRING 
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "\"",
    "escapeChar"  = "\""
)     
stored as textfile
LOCATION '/path/to/hdfs/dir/csv'
tblproperties('skip.header.line.count'='1');

Then I try to count the rows (The correct result should by 1)

Select count(*) from hive_database.hive_table;

But the result is 4 what is incorrect. Do you have any idea how to solve it? Thanks all.


回答1:


There is right now no way to handle multilines csv in hive directly. However, there is some workaround:

  1. produce a csv with \n or \r\n replaced with your own newline marker such <\br>. You will be able to load it in hive. Then transform the resulting text by replacing the latter by the former

  2. use spark, it has a multiline csv reader. This works out the box, while the csv beeing not read in a distributed way.

    val df = spark.read
    .option("wholeFile", true)
    .option("multiline",true)
    .option("header", true)
    .option("inferSchema", "true")
    .option("dateFormat", "yyyy-MM-dd")
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
    .csv("test.csv")
    .write.format("orc")
    .saveAsTable("myschma.myTable")
    
  3. use an other format such parquet, avro, orc, sequence file, instead of a csv. For example you could use sqoop to produce them from a jdbc database. Or you could write your own program in java or python.




回答2:


I found the solution. You can define your own InputFormatter. Then the DDL for HQL table will looks like this (At first you need to add your custom jar file):

ADD JAR /path/to/your/jar/CSVCustomInputFormat.jar;
DROP TABLE hive_database.hive_table;
CREATE EXTERNAL TABLE  hive_database.hive_table
(   
    ID STRING,
    PR_ID STRING,
    SUMMARY STRING 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\"",
   "escapeChar"    = "\\"
) 
STORED AS 
INPUTFORMAT 'com.hql.custom.formatter.CSVCustomInputFormatt' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
LOCATION '/path/to/hdfs/dir/csv'
tblproperties('skip.header.line.count'='1');

Then how to create the custom input formatter you can see for example here: https://analyticsanvil.wordpress.com/2016/03/06/creating-a-custom-hive-input-format-and-record-reader-to-read-fixed-format-flat-files/



来源:https://stackoverflow.com/questions/48763661/hive-table-from-csv-the-line-termination-in-quotes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!