My lecturer stated:
All column names in SELECT list must appear in GROUP BY clause unless name is used only in an aggregate function
I\'m ju
That's historically true. Omitting unaggregated columns leads to indeterminate behavior. SQL aims at fully determinate behavior.
But SQL standards have recently changed to let you omit from the GROUP BY clause columns that are functionally dependent on columns that are in the GROUP BY. PostgreSQL follows the more recent SQL standards. (It's not the only one.) Behavior is still fully determinate.
create table a (
a_id integer primary key,
xfr_date date not null
);
create table b (
a_id integer not null references a (a_id),
recd_date date not null,
units_recd integer not null
check (units_recd >= 0),
primary key (a_id, recd_date)
);
select a.a_id, a.xfr_date, sum(b.units_recd)
from a
inner join b on a.a_id = b.a_id
group by a.a_id; -- The column a.xfr_date is functionally dependent
-- on a.a_id; it doesn't have to appear in the
-- GROUP BY clause.
The notable deviation from from SQL standards is MySQL. It lets you omit just about everything from the GROUP BY. But that design choice makes its behavior indeterminate when you omit columns that are in the SELECT list.
This is an answer for Michael Will's example/question.
SELECT
e3.user_id,
MAX(e3.purchased) AS purchased,
e3.spent,
e.event
FROM
events e
INNER JOIN
(SELECT
e1.user_id AS user_id,
MIN(e1.created_at) as spent,
e2.created_at as purchased
FROM
events e1
INNER JOIN
(SELECT e.user_id, e.created_at from events e WHERE e.event = 'a') e2
ON e1.user_id = e2.user_id
AND e1.created_at >= e2.created_at
AND e1.event != 'a'
GROUP BY e1.User_ID, e2.created_at
) e3
ON e.user_id = e3.user_id AND e.created_at = e3.spent
GROUP BY e3.user_id, e3.spent, e.event;
There are exceptions as noted by Sam Saffron but generally what your lecturer said is true.
If I select 3 columns and group by 2 what should the RDBMS do with the 3rd column?
The developers of the RDBMS may make a decision of how to handle the extra colum (as it appears MySQL's developers have) but is it the decision I would have made or the one I want when writing the select? Will the decision always be valid? I certainly prefer the Oracle-like approach of forcing me to explicitly state what should happen.
If I select 3 columns and group by 2 should the RDBS group by all 3, pick a random value from the 3rd, the biggest or littlest, the most common?
Imagine the following:
A B C
Cat 10 False
Dog 25 True
Dog 20 False
Cat 5 False
If you select A, B and Group By Only A - what would your output be? You'd only have two rows (or tuples) because you have two values for A - but how does it display B?
If you group by A, B, you'd get four rows, no problems there. If you group by A and perform a function on B - like SUM(B) then you get two rows again:
Cat 15
Dog 45
But if you select A, B and only group by A - it doesn't know what to do. Truthfully, I believe there are some databases out there that will select a random value for B in that case and I believe there are some that will give you an error message.
So the simple answer is: It depends. Mysql allows it, vertica doesn't.
There is actually a valid use case for omitting and that is when you are already selecting say with MIN().
Here is an actual example for event tracking. Imaging you have credit and purchase events.
For simplicity we say a=credit, b,c,d are some kind of purchase event, and time is tracked with a running number. Now you want to find the date of the first purchase after each credit. We also happen to have only one customer 0:
create table events (user_id int ,created_at int, event varchar(255));
insert into events values (0,0, 'a');
insert into events values (0,1, 'b');
insert into events values (0,2, 'c');
insert into events values (0,3, 'a');
insert into events values (0,4, 'c');
insert into events values (0,5, 'b');
insert into events values (0,6, 'a');
insert into events values (0,7, 'a');
insert into events values (0,8, 'd');
mysql> SELECT user_id, MAX(purchased) AS purchased, spent, event FROM (SELECT e1.User_ID AS user_id, e1.created_at AS purchased, MIN(e2.created_at) AS spent, e2.event AS event FROM events e1, events e2 WHERE e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND e1.Event = 'a' AND e2.Event != 'a' GROUP BY e1.user_id, e1.created_at) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
| 0 | 0 | 1 | b |
| 0 | 3 | 4 | c |
| 0 | 7 | 8 | d |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)
looks good in mysql, does not work in vertica:
ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function
if I omit the event column, it works in both, but I do actually want to know what specific value event had for the row that min selected.
So my answer ends with a request for comment :) Any ideas?
If you are grouping on something you cannot see the individual values of non-grouped columns because there may be more than one value within each group. All you can do is report on aggregate functions (sum, count, min & etc) -- these are able to combine the multiple values into a single cell in the result.