Efficient time series querying in Postgres

前端 未结 4 1798
逝去的感伤
逝去的感伤 2020-12-30 13:05

I have a table in my PG db that looks somewhat like this:

id | widget_id | for_date | score |

Each referenced widget has a lot of these ite

相关标签:
4条回答
  • 2020-12-30 13:36

    SQL Fiddle

    select
        widget_id,
        for_date,
        case
            when score is not null then score
            else first_value(score) over (partition by widget_id, c order by for_date)
            end score
    from (
        select
            a.widget_id,
            a.for_date,
            s.score,
            count(score) over(partition by a.widget_id order by a.for_date) c
        from (
            select widget_id, g.d::date for_date
            from (
                select distinct widget_id
                from score
                ) s
                cross join
                generate_series(
                    (select min(for_date) from score),
                    (select max(for_date) from score),
                    '1 day'
                ) g(d)
            ) a
            left join
            score s on a.widget_id = s.widget_id and a.for_date = s.for_date
    ) s
    order by widget_id, for_date
    
    0 讨论(0)
  • 2020-12-30 13:43

    First of all, you can have a much simpler generate_series() table expression. Equivalent to yours (except for descending order, that contradicts the rest of your question anyways):

    SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
    

    The type date is coerced to timestamptz automatically on input. The return type is timestamptz either way. I use a subquery below, so I can cast to the output to date right away.

    Next, max() as window function returns exactly what you need: the highest value since frame start ignoring NULL values. Building on that, you get a radically simple query.

    For a given widget_id

    Most likely faster than involving CROSS JOIN or WITH RECURSIVE:

    SELECT a.day, s.*
    FROM  (
       SELECT d.day
             ,max(s.for_date) OVER (ORDER BY d.day) AS effective_date
       FROM  (
          SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
          ) d(day)
       LEFT   JOIN score s ON s.for_date = d.day
                          AND s.widget_id = 1337 -- "for a given widget_id"
       ) a
    LEFT   JOIN score s ON s.for_date = a.effective_date
                       AND s.widget_id = 1337
    ORDER  BY a.day;
    

    ->sqlfiddle

    With this query you can put any column from score you like into the final SELECT list. I put s.* for simplicity. Pick your columns.

    If you want to start your output with the first day that actually has a score, simply replace the last LEFT JOIN with JOIN.

    Generic form for all widget_id's

    Here I use a CROSS JOIN to produce a row for every widget on every date ..

    SELECT a.day, a.widget_id, s.score
    FROM  (
       SELECT d.day, w.widget_id
             ,max(s.for_date) OVER (PARTITION BY w.widget_id
                                    ORDER BY d.day) AS effective_date
       FROM  (SELECT generate_series('2012-05-05'::date
                                    ,'2012-05-15'::date, '1d')::date AS day) d
       CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
       LEFT   JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
       ) a
    JOIN  score s ON s.for_date = a.effective_date
                 AND s.widget_id = a.widget_id  -- instead of LEFT JOIN
    ORDER BY a.day, a.widget_id;
    

    ->sqlfiddle

    0 讨论(0)
  • 2020-12-30 13:45

    Using your table structure, I created the following Recursive CTE which starts with your MIN(For_Date) and increments until it reaches the MAX(For_Date). Not sure if there is a more efficient way, but this appears to work well:

    WITH RECURSIVE nodes_cte(widgetid, for_date, score) AS (
    -- First Widget Using Min Date
     SELECT 
        w.widgetId, 
        w.for_date, 
        w.score
     FROM widgets w 
      INNER JOIN ( 
          SELECT widgetId, Min(for_date) min_for_date
          FROM widgets
          GROUP BY widgetId
       ) minW ON w.widgetId = minW.widgetid 
            AND w.for_date = minW.min_for_date
    UNION ALL
     SELECT 
        n.widgetId,
        n.for_date + 1 for_date,
        coalesce(w.score,n.score) score
     FROM nodes_cte n
      INNER JOIN (
          SELECT widgetId, Max(for_date) max_for_date
          FROM widgets 
          GROUP BY widgetId
       ) maxW ON n.widgetId = maxW.widgetId
      LEFT JOIN widgets w ON n.widgetid = w.widgetid 
        AND n.for_date + 1 = w.for_date
      WHERE n.for_date + 1 <= maxW.max_for_date
    )
    SELECT * 
    FROM nodes_cte 
    ORDER BY for_date
    

    Here is the SQL Fiddle.

    And the returned results (format the date however you'd like):

    WIDGETID   FOR_DATE                     SCORE
    1337       May, 07 2012 00:00:00+0000   12
    1337       May, 08 2012 00:00:00+0000   41
    1337       May, 09 2012 00:00:00+0000   41
    1337       May, 10 2012 00:00:00+0000   41
    1337       May, 11 2012 00:00:00+0000   500
    

    Please note, this assumes your For_Date field is a Date -- if it includes a Time -- then you may need to use Interval '1 day' in the query above instead.

    Hope this helps.

    0 讨论(0)
  • 2020-12-30 13:45

    The data:

    DROP SCHEMA tmp CASCADE;
    CREATE SCHEMA tmp ;
    SET search_path=tmp;
    
    CREATE TABLE widget
            ( widget_id INTEGER NOT NULL
            , for_date DATE NOT NULL
            , score INTEGER
             , PRIMARY KEY (widget_id,for_date)
            );
    INSERT INTO widget(widget_id , for_date , score) VALUES
     (1312, '2012-05-07', 20)
    , (1337, '2012-05-07', 12)
    , (1337, '2012-05-08', 41)
    , (1337, '2012-05-11', 500)
            ;
    

    The query:

    SELECT w.widget_id AS widget_id
            , cal::date AS for_date
            -- , w.for_date AS org_date
            , w.score AS score
    FROM generate_series( '2012-05-07'::timestamp , '2012-05-11'::timestamp
                     , '1day'::interval) AS cal
            -- "half cartesian" Join;
            -- will be restricted by the NOT EXISTS() below
    LEFT JOIN widget w ON w.for_date <= cal
    WHERE NOT EXISTS (
            SELECT * FROM widget nx
            WHERE nx.widget_id = w.widget_id
            AND nx.for_date <= cal
            AND nx.for_date > w.for_date
            )
    ORDER BY cal, w.widget_id
            ;
    

    The result:

     widget_id |  for_date  | score 
    -----------+------------+-------
          1312 | 2012-05-07 |    20
          1337 | 2012-05-07 |    12
          1312 | 2012-05-08 |    20
          1337 | 2012-05-08 |    41
          1312 | 2012-05-09 |    20
          1337 | 2012-05-09 |    41
          1312 | 2012-05-10 |    20
          1337 | 2012-05-10 |    41
          1312 | 2012-05-11 |    20
          1337 | 2012-05-11 |   500
    (10 rows)
    
    0 讨论(0)
提交回复
热议问题