I am having trouble while trying to understand the concept of semi-join and how it is different from conventional join. I have tried some article already but not satisfied w
As I understand, a semi join is a left join or right join:
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
So the difference between a left (semi) join and a "conventional" join is that you only retrieve the data of the left table (where you have a match on your join condition). Whereas with a full (outer) join (I think thats what you mean by conventional join), you retrieve the data of both tables where your condition matches.
Simple example. Let's select students with grades using left outer join:
SELECT DISTINCT s.id
FROM students s
LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL
Now the same with left semi-join:
SELECT s.id
FROM students s
WHERE EXISTS (SELECT 1 FROM grades g
WHERE g.student_id = s.id)
The latter is much more efficient.
As far as I know SQL dialects that support SEMIJOIN/ANTISEMI
are U-SQL/Cloudera Impala.
SEMIJOIN:
Semijoins are U-SQL’s way filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B) pattern.
More info Semi Join and Anti Join Should Have Their Own Syntax in SQL:
“Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.
-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
SELECT DeptName
FROM Dept
)
-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
SELECT 1
FROM Dept
WHERE Employee.DeptName = Dept.DeptName
)
EDIT:
Another dialect that supports SEMI/ANTISEMI join is KQL:
kind=leftsemi (or kind=rightsemi)
Returns all the records from the left side that have matches from the right. The result table contains columns from the left side only.
let t1 = datatable(key:long, value:string)
[1, "a",
2, "b",
3, "c"];
let t2 = datatable(key:long)
[1,3];
t1 | join kind=leftsemi (t2) on key
demo
Output:
key value
1 a
3 c