I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.
Well, hive does not provide row level update/delete, therefore we can avoid the duplicate data while loading the data in base tables.As shown below
CREATE TABLE RAW_TABLE
(
COL1 STRING,
COL2 STRING,
CREATEDATE STRING,
DAYID STRING,
MARKETID STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATE BY'\t'
STORED AS TEXTFILE;
LOAD DATA INPATH '/FOLDER/TO/EXAMPLE.txt INTO RAW_TABLE;
CREATE TABLE JLT_CLEAN AS
SELECT col1,
col2,
dayid,
marketid,
MAX(createdate) AS createdate
FROM JLT_STAHING
GROUP BY col1,
col2,
dayid,
marketid;
This what we can use.