Removing DUPLICATE rows in hive based on columns

前端未结

关注

 3  1274

南方客 2021-02-09 18:19

I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.

3条回答

迷失自我 (楼主)

2021-02-09 19:11

Well, hive does not provide row level update/delete, therefore we can avoid the duplicate data while loading the data in base tables.As shown below

CREATE TABLE RAW_TABLE  
(
    COL1 STRING,
    COL2 STRING,
    CREATEDATE STRING,
    DAYID STRING,
    MARKETID STRING
)
ROW FORMAT DELIMITED 
FIELDS TERMINATE BY'\t'
STORED AS TEXTFILE;

LOAD DATA INPATH '/FOLDER/TO/EXAMPLE.txt  INTO RAW_TABLE;

CREATE TABLE JLT_CLEAN AS
SELECT col1,
  col2,
  dayid,
  marketid,
  MAX(createdate) AS createdate
FROM JLT_STAHING
GROUP BY col1,
  col2,
  dayid,
  marketid;

This what we can use.

0 讨论(0)

查看其它3个回答