问题
I want to update the existing data if it exists based on some condition(data with higher priority should be updated) and insert new data if not exists.
I have already written a query for this but somehow it is duplicating the number of rows. Here is the full explanation of what I have and what I want to achieve:
What I have: Table 1 - columns - id,info,priority
hive> select * from sample1;
OK
1 123 1.01
2 234 1.02
3 213 1.03
5 213423 1.32
Time taken: 1.217 seconds, Fetched: 4 row(s)
Table 2: columns - id,info,priority
hive> select * from sample2;
OK
1 1234 1.05
2 23412 1.01
3 21 1.05
4 1232 1.1
2 3432423 1.6
3 34324 1.4
What I want is the final table should have only 1 row per id with the data according to the greatest priority:
1 1234 1.05
2 3432423 1.6
3 34324 1.4
4 1232 1.1
5 213423 1.32
The query that I have written is this:
insert overwrite table sample1
select a.id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from sample1 a
join
sample2 b
on a.id=b.id where b.id in (select distinct(id) from sample1)
union all
select * from sample2 where id not in (select distinct(id) from sample1)
union all
select * from sample1 where id not in (select distinct(id) from sample2);
After running this query, I am getting this result:
hive> select * from sample1;
OK
1 1234 1.05
2 234 1.02
3 21 1.05
2 3432423 1.6
3 34324 1.4
5 213423 1.32
4 1232 1.1
How do I modify the present query to achieve the correct result. Is there any other method/process that I can follow to achieve the end result. I am using hadoop 2.5.2 along with HIVE 1.2.1 . I am working on a 6 node cluster with 5 slaves and 1 NN.
回答1:
Use FULL JOIN
, it will return all joined rows plus all not joined rows from the left and all not joined rows from the right tables.
sample2
table contains duplicated rows per id
, this is why join duplicates rows, use row_number()
analytic function to select only rows with highest priority from sample2
table:
insert overwrite table sample1
select
nvl(a.id, b.id) as id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from ( select a.*, row_number() over (partition by id order by prio desc) rn
from sample1 a
) a
full join
( select b.*, row_number() over (partition by id order by prio desc) rn
from sample2 b
) b on a.id=b.id and b.rn=1 --join only with highest priority rows
where a.rn=1;
If sample1
table also contains multiple rows per id
(it is not in your example), apply the same technique using row_number to the table sample1.
See also the answer about merge using full join
: https://stackoverflow.com/a/37744071/2700344
Also as of Hive 2.2 you can use ACID Merge, see examples
回答2:
Since I had multiple id rows for each id, so I firstly consolidated the IDs using a spark script. The solution could be found here : SPARK 2.2.2 - Joining multiple RDDs giving out of memory excepton. Resulting RDD has 124 columns. What should be the optimal joining method? Then I used the query mentioned in the question to get the desired result.
回答3:
adding to previously good answers! try this also:
insert overwrite table UDB.SAMPLE1
select
COALESCE(id2,id )
,COALESCE(info2,info)
,COALESCE(priority2, priority)
from
UDB.SAMPLE1 TAB1
full outer JOIN
(
select id2, info2, priority2
from
(
select
id as id2
,info as info2
,priority as priority2
,row_number() over (partition by id order by priority desc) rn
from UDB.SAMPLE2
)TAB2_wt
where TAB2_wt.rn =1
)TAB2
on TAB2.id2 = TAB1.id
;
select * from SAMPLE1;
+-----+----------+-----------+--+
| id | info | priority |
+-----+----------+-----------+--+
| 1 | 1234 | 1.05 |
| 2 | 3432423 | 1.6 |
| 3 | 34324 | 1.4 |
| 4 | 1232 | 1.1 |
| 5 | 213423 | 1.32 |
+-----+----------+-----------+--+
来源:https://stackoverflow.com/questions/53001909/hive-how-to-update-the-existing-data-if-it-exists-based-on-some-condition-and-in