Removing DUPLICATE rows in hive based on columns

前端 未结 3 1274
南方客
南方客 2021-02-09 18:19

I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.

3条回答
  •  迷失自我
    2021-02-09 19:11

    Well, hive does not provide row level update/delete, therefore we can avoid the duplicate data while loading the data in base tables.As shown below

    CREATE TABLE RAW_TABLE  
    (
        COL1 STRING,
        COL2 STRING,
        CREATEDATE STRING,
        DAYID STRING,
        MARKETID STRING
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATE BY'\t'
    STORED AS TEXTFILE;
    
    LOAD DATA INPATH '/FOLDER/TO/EXAMPLE.txt  INTO RAW_TABLE;
    
    CREATE TABLE JLT_CLEAN AS
    SELECT col1,
      col2,
      dayid,
      marketid,
      MAX(createdate) AS createdate
    FROM JLT_STAHING
    GROUP BY col1,
      col2,
      dayid,
      marketid;
    

    This what we can use.

提交回复
热议问题