behavior of inner join inside exists sql

问题

let's say that there're 2 tables (in oracle SQL) like this:

user(user_id, company_id):

123    | company_id_1 |
123    | company_id_2 |

company(id, version_id):

company_id_1 | (null)      |
company_id_2 | version_id1 |

the following query returns 2 rows

company_id_1
company_id_2

SELECT distinct(company_id)
FROM user
WHERE user.user_id = 123 
AND user.company_id IS NOT NULL
AND EXISTS
  (SELECT 1
  FROM company
  INNER JOIN user ON company.id = user.company_id AND company.version_id IS NOT NULL);

I would expect there's only 1 result, which is company_id_2, but it returns 2 results (company_id_1, company_id_2)

A couple of other notes:

the following query does return 1 result for me

SELECT distinct(company_id)
FROM user
WHERE user.user_id = 123 
AND user.company_id IS NOT NULL
AND EXISTS
  (SELECT 1
  FROM company
  WHERE company.id = user.company_id AND company.version_id IS NOT NULL);

what's odd to me is the following statement (running the inner join individually) does return 1 result:

SELECT *
FROM company
INNER JOIN user ON company.id = user.company_id AND company.version_id IS NOT NULL
WHERE company.id IN (company_id_1, company_id_2)

So why does query with inner join inside exists returns 2 results? even though by running the inner join individually it only returns 1 result, and exists condition should only be evaluated to true for only company_id_2, which has the not-null version_id
Can you elaborate more on what's the difference between the inner join inside the exists vs the regular where clause inside exists, they both looks the same to me?

回答1:

The difference is that when you use exists - the query inside usually depends on the main query (mean uses columns from it and so can't be executed separately) and, so, for each row of the main query it checks if some data retreived by the subquery exists or not.

The problem of your first query is that the subquery inside exists block doesn't anyhow depend on the main query columns, it's a separate query which always return a row with 1, so, for any row of the main query the result of exists will be always true. So, your first query is just equivalent to

SELECT distinct(company_id)
FROM user
WHERE user.user_id = 123 
AND user.company_id IS NOT NULL

回答2:

Without the join, your filter using user.company_id is correlated from outside the subquery. This means that for each row in your outer query, the subquery could return different results.

Joins in exists subqueries are nothing special but you uncorrelated the subquery with your join. It now can run completely independently of the outer query. The exists filter works exactly the same way but because there is no correlation with the outer query, it will either always be true or always be false.

回答3:

Query - 1 gives you two records because the query inside the EXISTS is not a correlated sub-query and it becomes true for both the records of the user.user_id = 123. Please note that the table inside the EXISTS and the table outside (in the main query) are evaluated separately.

Your EXISTS condition has no sense here as it will be true for any record as it individually returns one record.

SELECT distinct(company_id)
FROM user
WHERE user.user_id = 123 
AND user.company_id IS NOT NULL
-- following will behave as an individual query
-- and has no relation will main query's user table
AND EXISTS
  (SELECT 1
  FROM company
  INNER JOIN user ON company.id = user.company_id AND company.version_id IS NOT NULL);

Now, Comming to your second query. It is a correlated sub-query and EXISTS becomes false for the user.company_id = 'company_id_2' so it returns only one record

SELECT distinct(company_id)
FROM user
WHERE user.user_id = 123 
AND user.company_id IS NOT NULL
-- in EXISTS condition user table is used which refers to the main query's user table
-- it is called the correlated sub-query
AND EXISTS
  (SELECT 1
  FROM company
  WHERE company.id = user.company_id AND company.version_id IS NOT NULL);

回答4:

The WHERE clause looks at one row at a time:

FROM user
WHERE ... EXISTS ( ... )

looks at a user row and checks whether there exists some data.

In your second query you check whether there exists a company version for the user:

SELECT 1
FROM company
WHERE company.id = user.company_id -- the company for the user we are looking at in the main query
AND company.version_id IS NOT NULL

This is how to use EXISTS; the subquery looks for data related to the main query's row.

In your first query, however, your exists subquery is this:

SELECT 1
FROM company
INNER JOIN user ON company.id = user.company_id 
AND company.version_id IS NOT NULL

You can run this stand-alone; it does hence not relate to the main query. What you ask here is: Does a user with a company version exists. The answer: Yes there exists such a user. This is true hence. True regardless on what you are dealing with in your main query. This is not how to use EXISTS. It is extremely rare that we use a stand-alone EXISTS clause, not related to the main query.

The third query merely joins the two tables and finds all rows matching all criteria. The join gives you this:

+--------------+-----------------+--------------------+--------------------+
| user.user_id | user.company_id | company.company_id | company.version_id |
+--------------+-----------------+--------------------+--------------------+
|          123 | company_id_1    | company_id_1       | (null)             |
|          123 | company_id_2    | company_id_2       | version_id1        |
+--------------+-----------------+--------------------+--------------------+

where only the second row matches your WHERE clause. So, only the second row gets returned.

Another thing I'd like to mention is that when you are forced to use DISTINCT, then ask yourself: what makes this necessary? How come there are duplicates you must remove? A normalized database usually doesn't contain duplicate data, so it's probably a weakness in the query that builds a too large intermediate result you must then reduce.

If looking for companies, select from companies:

select *
from company
where version_id is not null
and company_id in (select company_id from user where user_id = 123);

or with EXISTSinstead of IN:

select *
from company
where version_id is not null
and exists 
( 
  select null 
  from user
  where user.company_id = company.company_id
  and user.user_id = 123
);

来源：https://stackoverflow.com/questions/65033995/behavior-of-inner-join-inside-exists-sql

标签

sql

Oracle

inner-join

exists