问题
I am doing full outer join on 4 tables on the same column. I want to generate only 1 row for each different value in the Join column.
Inputs are:
employee1
+---------------------+-----------------+--+
| employee1.personid | employee1.name |
+---------------------+-----------------+--+
| 111 | aaa |
| 222 | bbb |
| 333 | ccc |
+---------------------+-----------------+--+
employee2
+---------------------+----------------+--+
| employee2.personid | employee2.sal |
+---------------------+----------------+--+
| 111 | 2 |
| 200 | 3 |
+---------------------+----------------+--+
employee3
+---------------------+------------------+--+
| employee3.personid | employee3.place |
+---------------------+------------------+--+
| 111 | bbsr |
| 300 | atl |
| 200 | ny |
+---------------------+------------------+--+
employee4
+---------------------+---------------+--+
| employee4.personid | employee4.dt |
+---------------------+---------------+--+
| 111 | 2019-02-21 |
| 300 | 2019-03-18 |
| 400 | 2019-03-18 |
+---------------------+---------------+--+
Expected Result one record for each personid, so total there should be 6 records(111,222,333,200,300,400) Like:
+-----------+---------+--------+----------+-------------+--+
| personid | f.name | u.sal | v.place | v_in.dt |
+-----------+---------+--------+----------+-------------+--+
| 111 | aaa | 2 | bbsr | 2019-02-21 |
| 200 | NULL | 3 | ny | NULL |
| 222 | bbb | NULL | NULL | NULL |
| 300 | NULL | NULL | atl | 2019-03-18 |
| 333 | ccc | NULL | NULL | NULL |
| 400 | NULL | NULL | NULL | 2019-03-18 |
+-----------+---------+--------+----------+-------------+--+
Result i am getting is:
+-----------+---------+--------+----------+-------------+--+
| personid | f.name | u.sal | v.place | v_in.dt |
+-----------+---------+--------+----------+-------------+--+
| 111 | aaa | 2 | bbsr | 2019-02-21 |
| 200 | NULL | 3 | NULL | NULL |
| 200 | NULL | NULL | ny | NULL |
| 222 | bbb | NULL | NULL | NULL |
| 300 | NULL | NULL | atl | NULL |
| 300 | NULL | NULL | NULL | 2019-03-18 |
| 333 | ccc | NULL | NULL | NULL |
| 400 | NULL | NULL | NULL | 2019-03-18 |
+-----------+---------+--------+----------+-------------+--+
Query used:
select coalesce(f.personid, u.personid, v.personid, v_in.personid) as personid,f.name,u.sal,v.place,v_in.dt
from employee1 f FULL OUTER JOIN employee2 u on f.personid=u.personid
FULL OUTER JOIN employee3 v on f.personid=v.personid
FULL OUTER JOIN employee4 v_in on f.personid=v_in.personid;
Please suggest how to generate the expected result.
回答1:
full outer join
is tricky, because you have to take previous NULL
s into account. But you can do:
select coalesce(f.personid, u.personid, v.personid, v_in.personid) as personid,f.name,u.sal,v.place,v_in.dt
from employee1 f FULL OUTER JOIN
employee2 u
on f.personid = u.personid FULL OUTER JOIN
employee3 v
on v.personid in (f.person_id, u.person_id) FULL OUTER JOIN
employee4 v_in
on v_in.personid in (f.person_id, u.person_id, v.person_id);
In databases that support using
for join
s (instead of on
) this is simpler. I don't think that Hive supports using
, though.
回答2:
FULL JOIN returns all rows joined + all not joined from the left table + all not joined from the right table. And since you are joining employee2
, employee3
, employee4
to the same employee1
table which does not contain personid=200
, all not joined rows returned from all four tables.
I'd suggest to UNION ALL all four tables providing NULLs for missing fields + aggregate group by personid
:
select personid, max(name) name, max(sal) sal, max(place) place, max(dt) dt
from
(
select personid, name, NULL sal, NULL place, NULL dt from employee1 e1
UNION ALL
select personid, NULL name, sal, NULL place, NULL dt from employee2 e2
UNION ALL
select personid, NULL name, NULL sal, place, NULL dt from employee3 e3
UNION ALL
select personid, NULL name, NULL sal, NULL place, dt from employee4 e4
)s
group by personid;
This will perform better than joins.
来源:https://stackoverflow.com/questions/55225131/hive-full-outer-join-returning-multiple-rows-for-same-join-key