SQL JOIN and different types of JOINs

后端 未结 6 1019
[愿得一人]
[愿得一人] 2020-11-21 05:39

What is a SQL JOIN and what are different types?

6条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-21 05:56

    Interestingly most other answers suffer from these two problems:

    • They focus on basic forms of join only
    • They (ab)use Venn diagrams, which are an inaccurate tool for visualising joins (they're much better for unions).

    I've recently written an article on the topic: A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL, which I'll summarise here.

    First and foremost: JOINs are cartesian products

    This is why Venn diagrams explain them so inaccurately, because a JOIN creates a cartesian product between the two joined tables. Wikipedia illustrates it nicely:

    The SQL syntax for cartesian products is CROSS JOIN. For example:

    SELECT *
    
    -- This just generates all the days in January 2017
    FROM generate_series(
      '2017-01-01'::TIMESTAMP,
      '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
      INTERVAL '1 day'
    ) AS days(day)
    
    -- Here, we're combining all days with all departments
    CROSS JOIN departments
    

    Which combines all rows from one table with all rows from the other table:

    Source:

    +--------+   +------------+
    | day    |   | department |
    +--------+   +------------+
    | Jan 01 |   | Dept 1     |
    | Jan 02 |   | Dept 2     |
    | ...    |   | Dept 3     |
    | Jan 30 |   +------------+
    | Jan 31 |
    +--------+
    

    Result:

    +--------+------------+
    | day    | department |
    +--------+------------+
    | Jan 01 | Dept 1     |
    | Jan 01 | Dept 2     |
    | Jan 01 | Dept 3     |
    | Jan 02 | Dept 1     |
    | Jan 02 | Dept 2     |
    | Jan 02 | Dept 3     |
    | ...    | ...        |
    | Jan 31 | Dept 1     |
    | Jan 31 | Dept 2     |
    | Jan 31 | Dept 3     |
    +--------+------------+
    

    If we just write a comma separated list of tables, we'll get the same:

    -- CROSS JOINing two tables:
    SELECT * FROM table1, table2
    

    INNER JOIN (Theta-JOIN)

    An INNER JOIN is just a filtered CROSS JOIN where the filter predicate is called Theta in relational algebra.

    For instance:

    SELECT *
    
    -- Same as before
    FROM generate_series(
      '2017-01-01'::TIMESTAMP,
      '2017-01-01'::TIMESTAMP + INTERVAL '1 month -1 day',
      INTERVAL '1 day'
    ) AS days(day)
    
    -- Now, exclude all days/departments combinations for
    -- days before the department was created
    JOIN departments AS d ON day >= d.created_at
    

    Note that the keyword INNER is optional (except in MS Access).

    (look at the article for result examples)

    EQUI JOIN

    A special kind of Theta-JOIN is equi JOIN, which we use most. The predicate joins the primary key of one table with the foreign key of another table. If we use the Sakila database for illustration, we can write:

    SELECT *
    FROM actor AS a
    JOIN film_actor AS fa ON a.actor_id = fa.actor_id
    JOIN film AS f ON f.film_id = fa.film_id
    

    This combines all actors with their films.

    Or also, on some databases:

    SELECT *
    FROM actor
    JOIN film_actor USING (actor_id)
    JOIN film USING (film_id)
    

    The USING() syntax allows for specifying a column that must be present on either side of a JOIN operation's tables and creates an equality predicate on those two columns.

    NATURAL JOIN

    Other answers have listed this "JOIN type" separately, but that doesn't make sense. It's just a syntax sugar form for equi JOIN, which is a special case of Theta-JOIN or INNER JOIN. NATURAL JOIN simply collects all columns that are common to both tables being joined and joins USING() those columns. Which is hardly ever useful, because of accidental matches (like LAST_UPDATE columns in the Sakila database).

    Here's the syntax:

    SELECT *
    FROM actor
    NATURAL JOIN film_actor
    NATURAL JOIN film
    

    OUTER JOIN

    Now, OUTER JOIN is a bit different from INNER JOIN as it creates a UNION of several cartesian products. We can write:

    -- Convenient syntax:
    SELECT *
    FROM a LEFT JOIN b ON 
    
    -- Cumbersome, equivalent syntax:
    SELECT a.*, b.*
    FROM a JOIN b ON 
    UNION ALL
    SELECT a.*, NULL, NULL, ..., NULL
    FROM a
    WHERE NOT EXISTS (
      SELECT * FROM b WHERE 
    )
    

    No one wants to write the latter, so we write OUTER JOIN (which is usually better optimised by databases).

    Like INNER, the keyword OUTER is optional, here.

    OUTER JOIN comes in three flavours:

    • LEFT [ OUTER ] JOIN: The left table of the JOIN expression is added to the union as shown above.
    • RIGHT [ OUTER ] JOIN: The right table of the JOIN expression is added to the union as shown above.
    • FULL [ OUTER ] JOIN: Both tables of the JOIN expression are added to the union as shown above.

    All of these can be combined with the keyword USING() or with NATURAL (I've actually had a real world use-case for a NATURAL FULL JOIN recently)

    Alternative syntaxes

    There are some historic, deprecated syntaxes in Oracle and SQL Server, which supported OUTER JOIN already before the SQL standard had a syntax for this:

    -- Oracle
    SELECT *
    FROM actor a, film_actor fa, film f
    WHERE a.actor_id = fa.actor_id(+)
    AND fa.film_id = f.film_id(+)
    
    -- SQL Server
    SELECT *
    FROM actor a, film_actor fa, film f
    WHERE a.actor_id *= fa.actor_id
    AND fa.film_id *= f.film_id
    

    Having said so, don't use this syntax. I just list this here so you can recognise it from old blog posts / legacy code.

    Partitioned OUTER JOIN

    Few people know this, but the SQL standard specifies partitioned OUTER JOIN (and Oracle implements it). You can write things like this:

    WITH
    
      -- Using CONNECT BY to generate all dates in January
      days(day) AS (
        SELECT DATE '2017-01-01' + LEVEL - 1
        FROM dual
        CONNECT BY LEVEL <= 31
      ),
    
      -- Our departments
      departments(department, created_at) AS (
        SELECT 'Dept 1', DATE '2017-01-10' FROM dual UNION ALL
        SELECT 'Dept 2', DATE '2017-01-11' FROM dual UNION ALL
        SELECT 'Dept 3', DATE '2017-01-12' FROM dual UNION ALL
        SELECT 'Dept 4', DATE '2017-04-01' FROM dual UNION ALL
        SELECT 'Dept 5', DATE '2017-04-02' FROM dual
      )
    SELECT *
    FROM days 
    LEFT JOIN departments 
      PARTITION BY (department) -- This is where the magic happens
      ON day >= created_at
    

    Parts of the result:

    +--------+------------+------------+
    | day    | department | created_at |
    +--------+------------+------------+
    | Jan 01 | Dept 1     |            | -- Didn't match, but still get row
    | Jan 02 | Dept 1     |            | -- Didn't match, but still get row
    | ...    | Dept 1     |            | -- Didn't match, but still get row
    | Jan 09 | Dept 1     |            | -- Didn't match, but still get row
    | Jan 10 | Dept 1     | Jan 10     | -- Matches, so get join result
    | Jan 11 | Dept 1     | Jan 10     | -- Matches, so get join result
    | Jan 12 | Dept 1     | Jan 10     | -- Matches, so get join result
    | ...    | Dept 1     | Jan 10     | -- Matches, so get join result
    | Jan 31 | Dept 1     | Jan 10     | -- Matches, so get join result
    

    The point here is that all rows from the partitioned side of the join will wind up in the result regardless if the JOIN matched anything on the "other side of the JOIN". Long story short: This is to fill up sparse data in reports. Very useful!

    SEMI JOIN

    Seriously? No other answer got this? Of course not, because it doesn't have a native syntax in SQL, unfortunately (just like ANTI JOIN below). But we can use IN() and EXISTS(), e.g. to find all actors who have played in films:

    SELECT *
    FROM actor a
    WHERE EXISTS (
      SELECT * FROM film_actor fa
      WHERE a.actor_id = fa.actor_id
    )
    

    The WHERE a.actor_id = fa.actor_id predicate acts as the semi join predicate. If you don't believe it, check out execution plans, e.g. in Oracle. You'll see that the database executes a SEMI JOIN operation, not the EXISTS() predicate.

    ANTI JOIN

    This is just the opposite of SEMI JOIN (be careful not to use NOT IN though, as it has an important caveat)

    Here are all the actors without films:

    SELECT *
    FROM actor a
    WHERE NOT EXISTS (
      SELECT * FROM film_actor fa
      WHERE a.actor_id = fa.actor_id
    )
    

    Some folks (especially MySQL people) also write ANTI JOIN like this:

    SELECT *
    FROM actor a
    LEFT JOIN film_actor fa
    USING (actor_id)
    WHERE film_id IS NULL
    

    I think the historic reason is performance.

    LATERAL JOIN

    OMG, this one is too cool. I'm the only one to mention it? Here's a cool query:

    SELECT a.first_name, a.last_name, f.*
    FROM actor AS a
    LEFT OUTER JOIN LATERAL (
      SELECT f.title, SUM(amount) AS revenue
      FROM film AS f
      JOIN film_actor AS fa USING (film_id)
      JOIN inventory AS i USING (film_id)
      JOIN rental AS r USING (inventory_id)
      JOIN payment AS p USING (rental_id)
      WHERE fa.actor_id = a.actor_id -- JOIN predicate with the outer query!
      GROUP BY f.film_id
      ORDER BY revenue DESC
      LIMIT 5
    ) AS f
    ON true
    

    It will find the TOP 5 revenue producing films per actor. Every time you need a TOP-N-per-something query, LATERAL JOIN will be your friend. If you're a SQL Server person, then you know this JOIN type under the name APPLY

    SELECT a.first_name, a.last_name, f.*
    FROM actor AS a
    OUTER APPLY (
      SELECT f.title, SUM(amount) AS revenue
      FROM film AS f
      JOIN film_actor AS fa ON f.film_id = fa.film_id
      JOIN inventory AS i ON f.film_id = i.film_id
      JOIN rental AS r ON i.inventory_id = r.inventory_id
      JOIN payment AS p ON r.rental_id = p.rental_id
      WHERE fa.actor_id = a.actor_id -- JOIN predicate with the outer query!
      GROUP BY f.film_id
      ORDER BY revenue DESC
      LIMIT 5
    ) AS f
    

    OK, perhaps that's cheating, because a LATERAL JOIN or APPLY expression is really a "correlated subquery" that produces several rows. But if we allow for "correlated subqueries", we can also talk about...

    MULTISET

    This is only really implemented by Oracle and Informix (to my knowledge), but it can be emulated in PostgreSQL using arrays and/or XML and in SQL Server using XML.

    MULTISET produces a correlated subquery and nests the resulting set of rows in the outer query. The below query selects all actors and for each actor collects their films in a nested collection:

    SELECT a.*, MULTISET (
      SELECT f.*
      FROM film AS f
      JOIN film_actor AS fa USING (film_id)
      WHERE a.actor_id = fa.actor_id
    ) AS films
    FROM actor
    

    As you have seen, there are more types of JOIN than just the "boring" INNER, OUTER, and CROSS JOIN that are usually mentioned. More details in my article. And please, stop using Venn diagrams to illustrate them.

提交回复
热议问题