Join vs. sub-query

前端 未结 19 2058
广开言路
广开言路 2020-11-21 05:05

I am an old-school MySQL user and have always preferred JOIN over sub-query. But nowadays everyone uses sub-query, and I hate it; I don\'t know why.

19条回答
  •  逝去的感伤
    2020-11-21 05:14

    I think what has been under-emphasized in the cited answers is the issue of duplicates and problematic results that may arise from specific (use) cases.

    (although Marcelo Cantos does mention it)

    I will cite the example from Stanford's Lagunita courses on SQL.

    Student Table

    +------+--------+------+--------+
    | sID  | sName  | GPA  | sizeHS |
    +------+--------+------+--------+
    |  123 | Amy    |  3.9 |   1000 |
    |  234 | Bob    |  3.6 |   1500 |
    |  345 | Craig  |  3.5 |    500 |
    |  456 | Doris  |  3.9 |   1000 |
    |  567 | Edward |  2.9 |   2000 |
    |  678 | Fay    |  3.8 |    200 |
    |  789 | Gary   |  3.4 |    800 |
    |  987 | Helen  |  3.7 |    800 |
    |  876 | Irene  |  3.9 |    400 |
    |  765 | Jay    |  2.9 |   1500 |
    |  654 | Amy    |  3.9 |   1000 |
    |  543 | Craig  |  3.4 |   2000 |
    +------+--------+------+--------+
    

    Apply Table

    (applications made to specific universities and majors)

    +------+----------+----------------+----------+
    | sID  | cName    | major          | decision |
    +------+----------+----------------+----------+
    |  123 | Stanford | CS             | Y        |
    |  123 | Stanford | EE             | N        |
    |  123 | Berkeley | CS             | Y        |
    |  123 | Cornell  | EE             | Y        |
    |  234 | Berkeley | biology        | N        |
    |  345 | MIT      | bioengineering | Y        |
    |  345 | Cornell  | bioengineering | N        |
    |  345 | Cornell  | CS             | Y        |
    |  345 | Cornell  | EE             | N        |
    |  678 | Stanford | history        | Y        |
    |  987 | Stanford | CS             | Y        |
    |  987 | Berkeley | CS             | Y        |
    |  876 | Stanford | CS             | N        |
    |  876 | MIT      | biology        | Y        |
    |  876 | MIT      | marine biology | N        |
    |  765 | Stanford | history        | Y        |
    |  765 | Cornell  | history        | N        |
    |  765 | Cornell  | psychology     | Y        |
    |  543 | MIT      | CS             | N        |
    +------+----------+----------------+----------+
    

    Let's try to find the GPA scores for students that have applied to CS major (regardless of the university)

    Using a subquery:

    select GPA from Student where sID in (select sID from Apply where major = 'CS');
    
    +------+
    | GPA  |
    +------+
    |  3.9 |
    |  3.5 |
    |  3.7 |
    |  3.9 |
    |  3.4 |
    +------+
    

    The average value for this resultset is:

    select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS');
    
    +--------------------+
    | avg(GPA)           |
    +--------------------+
    | 3.6800000000000006 |
    +--------------------+
    

    Using a join:

    select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
    
    +------+
    | GPA  |
    +------+
    |  3.9 |
    |  3.9 |
    |  3.5 |
    |  3.7 |
    |  3.7 |
    |  3.9 |
    |  3.4 |
    +------+
    

    average value for this resultset:

    select avg(GPA) from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
    
    +-------------------+
    | avg(GPA)          |
    +-------------------+
    | 3.714285714285714 |
    +-------------------+
    

    It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value. It is also evident that usage of distinct with the join - based statement will not eliminate the problem, given that it will erroneously keep one out of three occurrences of the 3.9 score. The correct case is to account for TWO (2) occurrences of the 3.9 score given that we actually have TWO (2) students with that score that comply with our query criteria.

    It seems that in some cases a sub-query is the safest way to go, besides any performance issues.

提交回复
热议问题