How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

前端 未结 7 1069
梦毁少年i
梦毁少年i 2020-12-28 17:14

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating exter

相关标签:
7条回答
  • 2020-12-28 17:58

    Following code solved same type of problem

    CREATE TABLE TableRowCSV2(    
        CODE STRING,        
        PRODUCTCODE STRING, 
        PRICE STRING     
    )
        COMMENT 'row data csv'    
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    
    WITH SERDEPROPERTIES (
       "separatorChar" = "\,",
       "quoteChar"     = "\""
    )
    STORED AS TEXTFILE
    tblproperties("skip.header.line.count"="1");
    
    0 讨论(0)
  • 2020-12-28 17:59

    Use the csv-serde-0.9.1.jar file in your hive query, see http://illyayalovyy.github.io/csv-serde/

    add jar /path/to/jar_file
    
    Create external table emrS3_import_1(col1 string, col2 string, col3 string, col4 string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
    with serdeproperties
    (
      "separatorChar" = "\;",
      "quoteChar" = "\"
    ) stored as textfile
    tblproperties("skip.header.line.count"="1") ---to skip if have any header file
    LOCATION 's3://emrTest/folder';
    
    0 讨论(0)
  • 2020-12-28 18:00

    I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.

    So I have searched with links and I have found perfect solution for this.

    We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar

    then follow below steps using hive prompt :

    add jar path/to/csv-serde.jar;
    
    create table employee1(id string, name string, addr string)
    row format serde 'com.bizo.hive.serde.csv.CSVSerde'
    with serdeproperties(
    "separatorChar" = "\;",
    "quoteChar" = "\"")
    stored as textfile
    ;
    

    and then load data from your given path using below query:

    load data local inpath 'path/xyz.csv' into table employee1;
    

    and then run :

    select * from employee1;
    

    Now you will see the magic. Thanks.

    0 讨论(0)
  • 2020-12-28 18:09

    If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.

    But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:

    CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';
    
    0 讨论(0)
  • 2020-12-28 18:13

    Hive doesn't support quoted strings right out of the box. There are two approaches to solving this:

    1. Use a different field separator (e.g. a pipe).
    2. Write a custom InputFormat based on OpenCSV.

    The faster (and arguably more sane) approach is to modify your initial the export process to use a different delimiter so you can avoid quoted strings. This way you can tell Hive to use an external table with a tab or pipe delimiter:

    CREATE TABLE foo (
      col1 INT,
      col2 STRING
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
    
    0 讨论(0)
  • 2020-12-28 18:15

    Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex.

    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

    0 讨论(0)
提交回复
热议问题