Performance implications of allowing alias to be used in HAVING clause

问题

I made a bit of a fool out of myself earlier today on this question. The question was using SQL Server, and the correct answer involved adding a HAVING clause. The initial mistake I made was to think that an alias in the SELECT statement could be used in the HAVING clause, which is not allowed in SQL Server. I made this error because I assumed that SQL Server had the same rules as MySQL, which does allow an alias to be used in the HAVING clause.

This got me curious, and I poked around on Stack Overflow and elsewhere, finding a bunch of material explaining why these rules are enforced on the two respective RDBMS. But nowhere did I find an explanation of what the performance implications would be of allowing/disallowing an alias in the HAVING clause.

To give a concrete example, I will duplicate the query which occurred in the above-mentioned question:

SELECT students.camID, campus.camName, COUNT(students.stuID) as studentCount
FROM students
JOIN campus
    ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING COUNT(students.stuID) > 3
ORDER BY studentCount

What would be the performance implications of using an alias in the HAVING clause instead of re specifying the COUNT? This question can answered directly in MySQL, and hopefully someone could give insight into what would happen in SQL if it were to support the alias in the HAVING clause.

This is a rare instance where it might be OK to tag a SQL question with both MySQL and SQL Server, so enjoy this moment in the sun.

回答1:

Narrowly focused on just that particular query, and with sample data loaded below. This does address some other queries such as the count(distinct ...) mentioned by others.

The alias in the HAVING appears to either slightly outperform or quite a bit outperform its alternative (depending on the query).

This uses a pre-existing table with about 5 million rows in it created quickly via this answer of mine which takes 3 to 5 minutes.

Resulting structure:

CREATE TABLE `ratings` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `thing` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;

But using INNODB instead. Creates the expected INNODB gap anomaly due to the range reservation inserts. Just saying, but makes no difference. 4.7 Million rows.

Modify the table to get near Tim's assumed schema.

rename table ratings to students; -- not exactly instanteous (a COPY)
alter table students add column camId int; -- get it near Tim's schema
-- don't add the `camId` index yet

The following will take a while. Run it again and again in chunks or else your connection may timeout. The timeout is due to 5 million rows without a LIMIT clause in the update statement. Note, we do have a LIMIT clause.

So we are doing it in half a million row iterations. Sets a column to a random number between 1 and 20

update students set camId=floor(rand()*20+1) where camId is null limit 500000; -- well that took a while (no surprise)

Keep running the above until no camId is null.

I ran it like 10 times (the whole thing takes 7 to 10 minutes)

select camId,count(*) from students
group by camId order by 1 ;

1   235641
2   236060
3   236249
4   235736
5   236333
6   235540
7   235870
8   236815
9   235950
10  235594
11  236504
12  236483
13  235656
14  236264
15  236050
16  236176
17  236097
18  235239
19  235556
20  234779

select count(*) from students;
-- 4.7 Million rows

Create a useful index (after the inserts of course).

create index `ix_stu_cam` on students(camId); -- takes 45 seconds

ANALYZE TABLE students; -- update the stats: http://dev.mysql.com/doc/refman/5.7/en/analyze-table.html
-- the above is fine, takes 1 second

Create the campus table.

create table campus
(   camID int auto_increment primary key,
    camName varchar(100) not null
);
insert campus(camName) values
('one'),('2'),('3'),('4'),('5'),
('6'),('7'),('8'),('9'),('ten'),
('etc'),('etc'),('etc'),('etc'),('etc'),
('etc'),('etc'),('etc'),('etc'),('twenty');
-- ok 20 of them

Run the two queries:

SELECT students.camID, campus.camName, COUNT(students.id) as studentCount 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING COUNT(students.id) > 3 
ORDER BY studentCount; 
-- run it many many times, back to back, 5.50 seconds, 20 rows of output

and

SELECT students.camID, campus.camName, COUNT(students.id) as studentCount 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING studentCount > 3 
ORDER BY studentCount; 
-- run it many many times, back to back, 5.50 seconds, 20 rows of output

So the times are identical. Ran each a dozen times.

The EXPLAIN output is the same for both

+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
| id | select_type | table    | type | possible_keys | key        | key_len | ref                  | rows   | Extra                           |
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
|  1 | SIMPLE      | campus   | ALL  | PRIMARY       | NULL       | NULL    | NULL                 |     20 | Using temporary; Using filesort |
|  1 | SIMPLE      | students | ref  | ix_stu_cam    | ix_stu_cam | 5       | bigtest.campus.camID | 123766 | Using index                     |
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+

Using the AVG() function, I am getting about a 12% increase in performance with the alias in the having (with identical EXPLAIN output) from the following two queries.

SELECT students.camID, campus.camName, avg(students.id) as studentAvg 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING avg(students.id) > 2200000 
ORDER BY students.camID; 
-- avg time 7.5

explain 

SELECT students.camID, campus.camName, avg(students.id) as studentAvg 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID, campus.camName 
HAVING studentAvg > 2200000
ORDER BY students.camID;
-- avg time 6.5

And lastly, the DISTINCT:

SELECT students.camID, count(distinct students.id) as studentDistinct 
FROM students 
JOIN campus 
    ON campus.camID = students.camID 
GROUP BY students.camID 
HAVING count(distinct students.id) > 1000000 
ORDER BY students.camID; -- 10.6   10.84   12.1   11.49   10.1   9.97   10.27   11.53   9.84 9.98
-- 9.9

 SELECT students.camID, count(distinct students.id) as studentDistinct 
 FROM students 
 JOIN campus 
    ON campus.camID = students.camID 
 GROUP BY students.camID 
 HAVING studentDistinct > 1000000 
 ORDER BY students.camID; -- 6.81    6.55   6.75   6.31   7.11 6.36   6.55
-- 6.45

The alias in the having consistently runs 35% faster with the same EXPLAIN output. Seen below. So the same Explain output has been shown twice to not result in the same performance, but as a general clue.

+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table    | type  | possible_keys | key        | key_len | ref                  | rows   | Extra                                        |
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | campus   | index | PRIMARY       | PRIMARY    | 4       | NULL                 |     20 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | students | ref   | ix_stu_cam    | ix_stu_cam | 5       | bigtest.campus.camID | 123766 | Using index                                  |
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+

The Optimizer Appears to favor the alias in the having at the moment, especially for the DISTINCT.

回答2:

This is too long for a comment.

I don't think there really are any performance implications, unless the expression in the having clause contains complicated processing (say, count(distinct) or a complex function, such as string processing on a long string).

I'm almost certain that MySQL will perform the aggregation function twice if it is mentioned twice in the query. I am not sure if SQL Server will optimize away the second reference, but I would guess not (SQL Server has a good optimizer but it isn't so good a common expression elimination).

The question then is the complexity of the expression. Simple expressions such as count() and sum() really don't incur much additional overhead -- once the aggregation is already being done. Complex expressions might start to get expensive.

If you have a complex expression in SQL Server, you should be able to guarantee that it is evaluated only once by using a subquery.

回答3:

I was expecting the SQL to proceed in the order of FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY

I am not a MYSQL expert, but found this out reason in the MYSQL Documentation on why it is legal.

MySQL extends the standard SQL use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Result set sorting occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses.

A similar MySQL extension applies to the HAVING clause. In standard SQL, a query cannot refer to nonaggregated columns in the HAVING clause that are not named in the GROUP BY clause. To simplify calculations, a MySQL extension permits references to such columns. This extension assumes that the nongrouped columns have the same group-wise values. Otherwise, the result is indeterminate.

On the performance impact, I am assuming that, the aliased having will slower than the unaliased having since the filter has to be applied after all the execution. I will wait for the experts to comment.

来源：https://stackoverflow.com/questions/38187913/performance-implications-of-allowing-alias-to-be-used-in-having-clause

标签

mysql

sql

sql-server

having