PostgreSQL window function: partition by comparison

故事扮演 提交于 2019-11-28 08:49:26
Erwin Brandstetter

Using several different window functions and two subqueries, this should work decently fast:

WITH events(id, event, ts) AS (
  VALUES
   (1, 12, '2014-03-19 08:00:00'::timestamp)
  ,(2, 12, '2014-03-19 08:30:00')
  ,(3, 13, '2014-03-19 09:00:00')
  ,(4, 13, '2014-03-19 09:30:00')
  ,(5, 12, '2014-03-19 10:00:00')
   )
SELECT first_value(pre_id)  OVER (PARTITION BY grp ORDER BY ts)      AS pre_id
     , id, ts
     , first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM  (
   SELECT *, count(step) OVER w AS grp
   FROM  (
      SELECT id, ts
           , NULLIF(lag(event) OVER w, event) AS step
           , lag(id)  OVER w AS pre_id
           , lead(id) OVER w AS post_id
      FROM   events
      WINDOW w AS (ORDER BY ts)
      ) sub1
   WINDOW w AS (ORDER BY ts)
   ) sub2
ORDER  BY ts;

Using ts as name for the timestamp column.
Assuming ts to be unique - and indexed (a unique constraint does that automatically).

In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.

For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.

How?

  1. In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).

  2. In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.

  3. In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
    Actually, this should be even faster in the outer SELECT:

     last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
                               RANGE BETWEEN UNBOUNDED PRECEDING
                                     AND     UNBOUNDED FOLLOWING) AS post_id
    

    ... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.

SQL Fiddle.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!