HIVE how to update the existing data if it exists based on some condition and insert new data if not exists

问题

I want to update the existing data if it exists based on some condition(data with higher priority should be updated) and insert new data if not exists.

I have already written a query for this but somehow it is duplicating the number of rows. Here is the full explanation of what I have and what I want to achieve:

What I have: Table 1 - columns - id,info,priority

hive> select * from sample1;
OK
1   123     1.01
2   234     1.02
3   213     1.03
5   213423  1.32
Time taken: 1.217 seconds, Fetched: 4 row(s)

Table 2: columns - id,info,priority

hive> select * from sample2;
OK
1   1234    1.05
2   23412   1.01
3   21      1.05
4   1232    1.1
2   3432423 1.6
3   34324   1.4

What I want is the final table should have only 1 row per id with the data according to the greatest priority:

1   1234    1.05
2   3432423 1.6
3   34324   1.4
4   1232    1.1
5   213423  1.32

The query that I have written is this:

insert overwrite table sample1
select a.id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from sample1 a
join 
sample2 b
on a.id=b.id where b.id in (select distinct(id) from sample1)
union all
select * from sample2 where id not in (select distinct(id) from sample1)
union all
select * from sample1 where id not in (select distinct(id) from sample2);

After running this query, I am getting this result:

hive> select * from sample1;
OK
1   1234    1.05
2   234     1.02
3   21      1.05
2   3432423 1.6
3   34324   1.4
5   213423  1.32
4   1232    1.1

How do I modify the present query to achieve the correct result. Is there any other method/process that I can follow to achieve the end result. I am using hadoop 2.5.2 along with HIVE 1.2.1 . I am working on a 6 node cluster with 5 slaves and 1 NN.

回答1:

Use FULL JOIN, it will return all joined rows plus all not joined rows from the left and all not joined rows from the right tables. sample2 table contains duplicated rows per id, this is why join duplicates rows, use row_number() analytic function to select only rows with highest priority from sample2 table:

insert overwrite table sample1
select 
      nvl(a.id, b.id) as id,
      case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
      case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from ( select a.*, row_number() over (partition by id order by prio desc) rn 
         from sample1 a
     ) a
     full join 
          ( select b.*, row_number() over (partition by id order by prio desc) rn
              from sample2 b
          ) b on a.id=b.id and b.rn=1 --join only with highest priority rows
where a.rn=1;

If sample1 table also contains multiple rows per id (it is not in your example), apply the same technique using row_number to the table sample1.

See also the answer about merge using full join: https://stackoverflow.com/a/37744071/2700344

Also as of Hive 2.2 you can use ACID Merge, see examples

回答2:

Since I had multiple id rows for each id, so I firstly consolidated the IDs using a spark script. The solution could be found here : SPARK 2.2.2 - Joining multiple RDDs giving out of memory excepton. Resulting RDD has 124 columns. What should be the optimal joining method? Then I used the query mentioned in the question to get the desired result.

回答3:

adding to previously good answers! try this also:

insert overwrite table UDB.SAMPLE1
select 
 COALESCE(id2,id )
,COALESCE(info2,info)
,COALESCE(priority2, priority)
from 
UDB.SAMPLE1 TAB1
full outer JOIN
(
select id2, info2, priority2
from
(
select 
 id       as id2
,info     as info2
,priority as priority2
,row_number() over (partition by id order by priority desc) rn
from UDB.SAMPLE2
)TAB2_wt
where TAB2_wt.rn =1
)TAB2
on TAB2.id2 = TAB1.id
;

select * from SAMPLE1;

+-----+----------+-----------+--+
| id  |   info   | priority  |
+-----+----------+-----------+--+
| 1   | 1234     | 1.05      |
| 2   | 3432423  | 1.6       |
| 3   | 34324    | 1.4       |
| 4   | 1232     | 1.1       |
| 5   | 213423   | 1.32      |
+-----+----------+-----------+--+

来源：https://stackoverflow.com/questions/53001909/hive-how-to-update-the-existing-data-if-it-exists-based-on-some-condition-and-in

标签

Hadoop

Hive

hiveql

hadoop2