Removing DUPLICATE rows in hive based on columns

前端 未结 3 1273
南方客
南方客 2021-02-09 18:19

I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.

相关标签:
3条回答
  • 2021-02-09 19:03

    You can do the following :

    select col1,col2,dayid,marketid,max(createdate) as createdate
    from tablename
    group by col1,col2,dayid,marketid
    

    This way you are grouping the data by all the columns except the data so if there are rows with the same values in these columns they will be in the same group, and then, just "choose" the createdate you want by using an aggregate function like max/min etc.

    0 讨论(0)
  • 2021-02-09 19:11

    Well, hive does not provide row level update/delete, therefore we can avoid the duplicate data while loading the data in base tables.As shown below

    CREATE TABLE RAW_TABLE  
    (
        COL1 STRING,
        COL2 STRING,
        CREATEDATE STRING,
        DAYID STRING,
        MARKETID STRING
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATE BY'\t'
    STORED AS TEXTFILE;
    
    LOAD DATA INPATH '/FOLDER/TO/EXAMPLE.txt  INTO RAW_TABLE;
    
    CREATE TABLE JLT_CLEAN AS
    SELECT col1,
      col2,
      dayid,
      marketid,
      MAX(createdate) AS createdate
    FROM JLT_STAHING
    GROUP BY col1,
      col2,
      dayid,
      marketid;
    

    This what we can use.

    0 讨论(0)
  • 2021-02-09 19:13

    we don't need to write all the column name in sql code by this way:

    select * from (
      select *, row_number() over (partition by (col1, col2) order by col1) tmp_row_number
      from table_name
    ) t
    where t.tmp_row_number==1
    

    the only side effect is add an extra column tmp_row_number to the table.

    0 讨论(0)
提交回复
热议问题