including missing (zero-count) rows when using GROUP BY

问题

I have an application that receives sms messages. What i want to do is make a statistic with mysql that will count meessages in a hour. For example in 7 am i received 10 sms messages, in 8 am i received 20 etc. My table has this columns ID, smsText, smsDate ... (others are not important). When i run this script:

SELECT HOUR(smsDate), COUNT(ID) FROM SMS_MESSAGES GROUP BY HOUR(smsDate)

it show how many messages i get in every hour. The problem is when i dont receive any message for example in 5pm, this statement does't return a row 17 with count 0, and i have a result like this:

Hour Count
...
15 10
16 5
18 2
...

, and what i want to get is this

Hour Count
...
15 10
16 5
17 0
18 2
...

I searched for a solution on the web, something with UNION but i don't understand how to implement that one in mine. Hope someone can help me.

回答1:

You could create a table with all hours and join the tables:

CREATE TABLE IF NOT EXISTS `hours` (
  `hour` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `hours` (`hour`) VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11), (12), (13), (14), (15), (16), (17), (18), (19), (20), (21), (22), (23);

SELECT hours.hour, count( SMS_MESSAGES.ID ) 
FROM hours
LEFT JOIN SMS_MESSAGES ON ( hours.hour = HOUR( SMS_MESSAGES.smsDate ) ) 
GROUP BY 1

回答2:

As hellocode has answered with creating a new table which contains hours values is a good approach, here is another way to achieve this by using union

select t.`hour`,count(s.ID) from (
select 0 as `hour`
union
select 1 as `hour`
union
select 2 as `hour`
union
.
.
.
select 23 as `hour`
) t
left join SMS_MESSAGES s on(t.`hour` = hour(s.smsDate))
group by t.`hour`

回答3:

Observation: HOUR() simply extracts the hour from a timestamp. You may want date and hour in your query. This answer provides date and hour.

You need a way to get a virtual table containing all the hourly timestamps in the appropriate range. You then need to join that table to your aggregate query.

First things first: Here’s a query that will get the timestamps in the range.

SELECT mintime + INTERVAL seq.seq HOUR AS msghour
  FROM (
        SELECT MIN(DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR) AS mintime,
               MAX(DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR) AS maxtime
          FROM SMS_MESSAGES
       ) AS minmax
  JOIN seq_0_to_999999 AS seq ON seq.seq < TIMESTAMPDIFF(HOUR,mintime,maxtime)

What’s going on here? Three things.

First: DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR converts any arbitrary timestamp into a timestamp at the top of the hour. This lets us fetch the first and last hourly timestamp in your table.

Second, we have a subquery which determines the first and last hour (min and max smsDate) we care about reporting.

Second, we have a table called seq_0_to_999999. It contains a sequence of cardinal numbers: the integers starting at zero. More about this in a moment.

Joining these two tables together, then using the expression

mintime + INTERVAL seq.seq HOUR AS msghour

we can fetch a table that has a continuous run of hourly timestamps.

Then we join that to your query. Here's where it starts to look more complex that it is. We're doing this, in outline:

 SELECT DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR, COUNT(ID)
   FROM SMS_MESSAGES 
   JOIN ( /*the query above wit the sequence of timestamps*/) AS sq 
     ON DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR = msghour
  GROUP BY DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR
  ORDER BY DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR

Putting it all together, it looks like this:

 SELECT DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR, COUNT(ID)
   FROM SMS_MESSAGES 
   JOIN ( 
        SELECT mintime + INTERVAL seq.seq HOUR AS msghour
          FROM (
                SELECT MIN(DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR) AS mintime,
                       MAX(DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR) AS maxtime
                  FROM SMS_MESSAGES
               ) AS minmax
          JOIN seq_0_to_999999 AS seq ON seq.seq < TIMESTAMPDIFF(HOUR,mintime,maxtime)
       ) AS sq 
     ON DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR = msghour
  GROUP BY DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR
  ORDER BY DATE(smsDate) + INTERVAL HOUR(smsDate) HOUR

That will give you a result set with timestamp and count for every hour in the range.

Finally, what about this seq_0_to_999999 sequence table? Where do we get those integers starting with zero? The answer is this: we have to arrange to do that; those numbers aren’t built in to MySQL (MariaDB v10+ does have them).

The simple way is to create a table with a whole lot of integers in it. That will take up storage, though, so we'll skip that.

Another way is to create a short table with the integers from 0-9 in it, like so:

DROP TABLE IF EXISTS seq_0_to_9;
CREATE TABLE seq_0_to_9 AS
   SELECT 0 AS seq UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4
    UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9;

Then we can create a view that join that table with itself to generate 1000 combinations like this:

DROP VIEW IF EXISTS seq_0_to_999;
CREATE VIEW seq_0_to_999 AS (
SELECT (a.seq + 10 * (b.seq + 10 * c.seq)) AS seq
  FROM seq_0_to_9 a
  JOIN seq_0_to_9 b
  JOIN seq_0_to_9 c
);

Finally, we can join that table of 1000 numbers with itself to create a view that will generate a million combinations like this:

DROP VIEW IF EXISTS seq_0_to_999999;
CREATE VIEW seq_0_to_999999 AS (
SELECT (a.seq + (1000 * b.seq)) AS seq
  FROM seq_0_to_999 a
  JOIN seq_0_to_999 b
);

Here's a writeup providing more information about all this. http://www.plumislandmedia.net/mysql/filling-missing-data-sequences-cardinal-integers/

来源：https://stackoverflow.com/questions/24305085/including-missing-zero-count-rows-when-using-group-by

标签

mysql

count

group-by

aggregate-functions