Fill in missing rows when aggregating over multiple fields in Postgres

戏子无情 提交于 2019-12-04 03:57:04

问题


I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing.

SELECT 
sd.date, 
COUNT(sd.sale_id) AS sales, 
sd.product

FROM sales_data sd
-- sales per product, per day
GROUP BY sd.product, sd.date
ORDER BY sd.product, sd.date

This produces the following:

    date    | sales |       product                           
------------+-------+-------------------
 2017-08-17 |  10   | soap
 2017-08-19 |   2   | soap
 2017-08-20 |   5   | soap
 2017-08-17 |   2   | shower gel
 2017-08-21 |   1   | shower gel

As you can see - the date ranges per product are not continuous as sales_data just didn't contain any info for these products on some days.

What I'm aiming to do is to add a sales = 0 row for each product that is not sold on any day in a range - for example here, between 2017-08-17 and 2017-08-21 to give something like the the following:

    date    | sales |      product                           
------------+-------+-------------------
 2017-08-17 |  10   | soap
 2017-08-18 |   0   | soap
 2017-08-19 |   2   | soap
 2017-08-20 |   5   | soap
 2017-08-21 |   0   | soap
 2017-08-17 |   2   | shower gel
 2017-08-18 |   0   | shower gel
 2017-08-19 |   0   | shower gel
 2017-08-20 |   0   | shower gel
 2017-08-21 |   1   | shower gel

In a simpler case where there was only a single product, it seems like the solution would be to use generate_series() i.e.:

  • create a full range of dates using generate_series
  • LEFT JOIN the already aggregated sales data onto the date series
  • COALESCE any NULL counts to 0 in the missing rows

The problem I have is that this approach does not seem to work dates repeat in the aggregated data as I'm grouping over not just multiple dates, but multiple products also.

It feels like I should be able to do something cunning with window functions here to solve this e.g. joining onto the full date range over partitions defined by the product name - but I can't see a way of actually getting this to work.


回答1:


You could use:

WITH cte AS (
   SELECT date, s.product
   FROM  ... -- some way to generate date series
   CROSS JOIN (SELECT DISTINCT product FROM sales_data) s
)
SELECT 
    c.date,
    c.product,
    COUNT(sd.sale_id) AS sales
FROM cte c
LEFT JOIN sales_data sd
  ON c.date = sd.date AND c.product= sd.product
GROUP BY c.date, c.product
ORDER BY c.date, c.product;

First create Cartesian product of dates and products, then LEFT JOIN to actual data and do calculations.


Oracle has great feature for this scenarios called Partitioned Outer Joins:

SELECT times.time_id, product, quantity 
FROM inventory  PARTITION BY  (product) 
RIGHT OUTER JOIN times ON (times.time_id = inventory.time_id) 
WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY') 
      AND TO_DATE('06/04/01', 'DD/MM/YY') 
ORDER BY  2,1; 



回答2:


select 
    date, 
    count(sale_id) as sales, 
    product
from
    sales_data
    right join (
        (
            select d::date as date
            from generate_series (
                (select min(date) from sales_data),
                (select max(date) from sales_data),
                '1 day'
            ) gs (d)
        ) gs
        cross join
        (select distinct product from sales_data) p
    ) cj using (product, date)
group by product, date
order by product, date


来源:https://stackoverflow.com/questions/45738055/fill-in-missing-rows-when-aggregating-over-multiple-fields-in-postgres

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!