PostgreSQL window function: partition by comparison

后端 未结 1 1279
无人共我
无人共我 2020-12-09 14:20

I\'m trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query.

Imagine I have the short

相关标签:
1条回答
  • 2020-12-09 14:54

    Using several different window functions and two subqueries, this should work decently fast:

    WITH events(id, event, ts) AS (
      VALUES
       (1, 12, '2014-03-19 08:00:00'::timestamp)
      ,(2, 12, '2014-03-19 08:30:00')
      ,(3, 13, '2014-03-19 09:00:00')
      ,(4, 13, '2014-03-19 09:30:00')
      ,(5, 12, '2014-03-19 10:00:00')
       )
    SELECT first_value(pre_id)  OVER (PARTITION BY grp ORDER BY ts)      AS pre_id
         , id, ts
         , first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
    FROM  (
       SELECT *, count(step) OVER w AS grp
       FROM  (
          SELECT id, ts
               , NULLIF(lag(event) OVER w, event) AS step
               , lag(id)  OVER w AS pre_id
               , lead(id) OVER w AS post_id
          FROM   events
          WINDOW w AS (ORDER BY ts)
          ) sub1
       WINDOW w AS (ORDER BY ts)
       ) sub2
    ORDER  BY ts;
    

    Using ts as name for the timestamp column.
    Assuming ts to be unique - and indexed (a unique constraint does that automatically).

    In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
    Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.

    For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.

    How?

    1. In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).

    2. In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.

    3. In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
      Actually, this should be even faster in the outer SELECT:

       last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
                                 RANGE BETWEEN UNBOUNDED PRECEDING
                                       AND     UNBOUNDED FOLLOWING) AS post_id
      

      ... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.

    SQL Fiddle.

    0 讨论(0)
提交回复
热议问题