Average stock history table

后端 未结 3 1843
时光说笑
时光说笑 2021-01-01 06:27

I have a table that tracks changes in stocks through time for some stores and products. The value is the absolute stock, but we only insert a new row when a change in stock

相关标签:
3条回答
  • 2021-01-01 06:29

    This answer is based on the implied idea that you're seeking the average over days, so each day counts as a new row. While this can be handled in other SQL engines in row form, this one was easier to implement by taking apart Average (Sum(value)/count(value)) and extrapolating it to number of days at that value. using your table format, and this goal, I came up with this solution (SQLFiddle)

    select store_id, product_id, CASE WHEN sum(nextdate-date) > 0 THEN sum(Value*(nextdate-date)) / sum(nextdate-date) END as Avg_Value
    from (
      select *
          , (
            select value
            from stocks b
            where a.store_id = b.store_id
              and a.product_id = b.product_id
              and a.date >= b.date
            order by b.date
            limit 1
          )*1.0 "value"
          , coalesce((
            select date
            from stocks b
            where a.store_id = b.store_id
              and a.product_id = b.product_id
              and a.date < b.date
            order by b.date
            limit 1
          ),case when current_date > '2013-01-12' then '2013-01-12' else current_date end) nextdate
      from (
        select store_id, product_id, min(case when date < '2013-01-07' then '2013-01-07' else date end) date
        from stocks z
        where date < '2013-01-12'
        group by store_id, product_id
        ) a
      union all
      select store_id, product_id, date, value*1.0 "value"
        , coalesce((
          select date
          from stocks b
          where a.store_id = b.store_id
            and a.product_id = b.product_id
            and a.date < b.date
          order by b.date
          limit 1
        ),case when current_date > '2013-01-12' then '2013-01-12' else current_date end) nextdate
      from stocks a
      where a.date between '2013-01-07' and '2013-01-12'
    ) t
    group by store_id, product_id
    ;
    

    The query takes the first occurrence of each store/product before the start parameter ('2013-01-07'), and swaps in the parameter as the date if it's greater than the table's recorded date, selects the value for that early entry, and the date of the first change in the table after the start parameter, and keeps the next date constrained to the end parameter ('2013-01-12'). The second part of the union query grabs all changes between the two parameters, and the next change, or the current date, both constrained by the end parameter. Finally, the calculation is run on the results where values are multiplied by the date differences when they're summed, divided by the sum of days between the dates. Since all dates are constrained in the query, the average will be average over the exact window that are passed in as parameters.

    Not being all that up on PostgreSQL, I advise that if you plan to implement this in a function, copying this query and replacing all instances of '2013-01-07' with your start parameter name, and all instances '2013-01-12' of with your end parameter name will give you the results your looking for for any given date window.

    Edit: If you want an average over a different time unit, simply replace the two instances of nextdate-date with whatever date interval calculation you're looking for. nextdate-date returns number of days between the two.

    0 讨论(0)
  • 2021-01-01 06:48

    The special difficulty of this task: you cannot just pick data points inside your time range, but have to consider the latest data point before the time range and the earliest data point after the time range additionally. This varies for every row and each data point may or may not exist. Requires a sophisticated query and makes it hard to use indexes.

    You could use range types and operators (Postgres 9.2+) to simplify calculations:

    WITH input(a,b) AS (SELECT '2013-01-01'::date  -- your time frame here
                             , '2013-01-15'::date) -- inclusive borders
    SELECT store_id, product_id
         , sum(upper(days) - lower(days))                    AS days_in_range
         , round(sum(value * (upper(days) - lower(days)))::numeric
                        / (SELECT b-a+1 FROM input), 2)      AS your_result
         , round(sum(value * (upper(days) - lower(days)))::numeric
                        / sum(upper(days) - lower(days)), 2) AS my_result
    FROM (
       SELECT store_id, product_id, value, s.day_range * x.day_range AS days
       FROM  (
          SELECT store_id, product_id, value
               , daterange (day, lead(day, 1, now()::date)
                 OVER (PARTITION BY store_id, product_id ORDER BY day)) AS day_range 
          FROM   stock
          ) s
       JOIN  (
          SELECT daterange(a, b+1) AS day_range
          FROM   input
          ) x ON s.day_range && x.day_range
       ) sub
    GROUP  BY 1,2
    ORDER  BY 1,2;
    

    Note, I use the column name day instead of date. I never use basic type names as column names.

    In the subquery sub I fetch the day from the next row for each item with the window function lead(), using the built-in option to provide "today" as default where there is no next row.
    With this I form a daterange and match it against the input with the overlap operator &&, computing the resulting date range with the intersection operator *.

    All ranges here are with exclusive upper border. That's why I add one day to the input range. This way we can simply subtract lower(range) from upper(range) to get the number of days.

    I assume that "yesterday" is the latest day with reliable data. "Today" can still change in a real life application. Consequently, I use "today" (now()::date) as exclusive upper border for open ranges.

    I provide two results:

    • your_result agrees with your displayed results.
      You divide by the number of days in your date range unconditionally. For instance, if an item is only listed for the last day, you get a very low (misleading!) "average".

    • my_result computes the same or higher numbers.
      I divide by the actual number of days an item is listed. For instance, if an item is only listed for the last day, I return the listed value as average.

    To make sense of the difference I added the number of days the item was listed: days_in_range

    SQL Fiddle.

    Index and performance

    For this kind of data, old rows typically don't change. This would make an excellent case for a materialized view:

    CREATE MATERIALIZED VIEW mv_stock AS
    SELECT store_id, product_id, value
         , daterange (day, lead(day, 1, now()::date) OVER (PARTITION BY store_id, product_id
                                                           ORDER BY day)) AS day_range
    FROM   stock;
    

    Then you can add a GiST index which supports the relevant operator &&:

    CREATE INDEX mv_stock_range_idx ON mv_stock USING gist (day_range);
    

    Big test case

    I ran a more realistic test with 200k rows. The query using the MV was about 6 times as fast, which in turn was ~ 10x as fast as @Joop's query. Performance heavily depends on data distribution. An MV helps most with big tables and high frequency of entries. Also, if the table has columns that are not relevant to this query, a MV can be smaller. A question of cost vs. gain.

    I've put all solutions posted so far (and adapted) in a big fiddle to play with:

    SQL Fiddle with big test case.
    SQL Fiddle with only 40k rows - to avoid timeout on sqlfiddle.com

    0 讨论(0)
  • 2021-01-01 06:48

    This is rather quick&dirty: instead of doing the nasty interval arithmetic, just join to a calendar-table and sum them all.

    WITH calendar(zdate) AS ( SELECT generate_series('2013-01-01'::date, '2013-01-15'::date, '1 day'::interval)::date )
    SELECT st.store_id,st.product_id
            , SUM(st.zvalue) AS sval
            , COUNT(*) AS nval
            , (SUM(st.zvalue)::decimal(8,2) / COUNT(*) )::decimal(8,2) AS wval
    FROM calendar
    JOIN stocks st ON calendar.zdate >= st.zdate
            AND NOT EXISTS ( -- this calendar entry belongs to the next stocks entry 
                    SELECT * FROM stocks nx
                    WHERE nx.store_id = st.store_id AND nx.product_id = st.product_id
                    AND nx.zdate > st.zdate AND nx.zdate <= calendar.zdate
            )
    GROUP BY st.store_id,st.product_id
    ORDER BY st.store_id,st.product_id
            ;
    
    0 讨论(0)
提交回复
热议问题