Calendar tables in PostgreSQL 9

前端 未结 3 1844
囚心锁ツ
囚心锁ツ 2021-01-04 06:07

I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills).

I have come acr

3条回答
  •  悲哀的现实
    2021-01-04 06:47

    Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.

    If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.

    2011-01-01 00:00:00
    2011-01-01 00:01:00
    2011-01-01 00:02:00
    2011-01-01 00:03:00
    2011-01-01 00:04:00
    

    If you're doing "bucket" analysis, you might be better off with something like this.

    bucket_start         bucket_end
    --
    2011-01-01 00:00:00  2011-01-01 00:01:00
    2011-01-01 00:01:00  2011-01-01 00:02:00
    2011-01-01 00:02:00  2011-01-01 00:03:00
    2011-01-01 00:03:00  2011-01-01 00:04:00
    2011-01-01 00:04:00  2011-01-01 00:05:00
    

    Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)

    If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)

    create table calendar (
      bucket_start timestamp primary key,
      bucket_end timestamp unique,
      CHECK (bucket_end = bucket_start + interval '1' minute)
      -- You also want a "no gaps" constraint, but I don't think you 
      -- can do that in a CHECK constraint in PostgreSQL. You might
      -- be able to use a trigger that counts the rows, and compares
      -- that count to the number of minutes between min(bucket_start)
      -- and max(bucket_start). Worst case, you can always run a report
      -- that counts the rows and sends you an email.
    );
    

    The UNIQUE constraint creates an implicit index in PostgreSQL.

    This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.

    insert into calendar
    select coalesce(
                    (select max(bucket_start) from calendar), 
                     cast('2011-01-01 00:00:00' as timestamp)
                   ) 
                 + cast((n || 'minute') as interval) as bucket_start, 
           coalesce(
                    (select max(bucket_start) from calendar), 
                     cast('2011-01-01 00:00:00' as timestamp)
                   ) 
                 + cast((n + 1 || ' minute') as interval) as bucket_end
    from generate_series(1, (24*60) ) n;
    

    You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.

    It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.

提交回复
热议问题