Calendar tables in PostgreSQL 9

前端 未结 3 1843
囚心锁ツ
囚心锁ツ 2021-01-04 06:07

I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills).

I have come acr

相关标签:
3条回答
  • 2021-01-04 06:42

    In PostgreSQL, you can generate calendar tables of arbitrary length and granularity on the fly:

    SELECT  CAST('2011-01-01' AS DATE) + (n || ' hour')::INTERVAL
    FROM    generate_series(0, 23) n
    

    This does not require recursion (as with the other systems) and is a preferred method to generate the volatile resultsets.

    0 讨论(0)
  • 2021-01-04 06:47

    Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.

    If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.

    2011-01-01 00:00:00
    2011-01-01 00:01:00
    2011-01-01 00:02:00
    2011-01-01 00:03:00
    2011-01-01 00:04:00
    

    If you're doing "bucket" analysis, you might be better off with something like this.

    bucket_start         bucket_end
    --
    2011-01-01 00:00:00  2011-01-01 00:01:00
    2011-01-01 00:01:00  2011-01-01 00:02:00
    2011-01-01 00:02:00  2011-01-01 00:03:00
    2011-01-01 00:03:00  2011-01-01 00:04:00
    2011-01-01 00:04:00  2011-01-01 00:05:00
    

    Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)

    If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)

    create table calendar (
      bucket_start timestamp primary key,
      bucket_end timestamp unique,
      CHECK (bucket_end = bucket_start + interval '1' minute)
      -- You also want a "no gaps" constraint, but I don't think you 
      -- can do that in a CHECK constraint in PostgreSQL. You might
      -- be able to use a trigger that counts the rows, and compares
      -- that count to the number of minutes between min(bucket_start)
      -- and max(bucket_start). Worst case, you can always run a report
      -- that counts the rows and sends you an email.
    );
    

    The UNIQUE constraint creates an implicit index in PostgreSQL.

    This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.

    insert into calendar
    select coalesce(
                    (select max(bucket_start) from calendar), 
                     cast('2011-01-01 00:00:00' as timestamp)
                   ) 
                 + cast((n || 'minute') as interval) as bucket_start, 
           coalesce(
                    (select max(bucket_start) from calendar), 
                     cast('2011-01-01 00:00:00' as timestamp)
                   ) 
                 + cast((n + 1 || ' minute') as interval) as bucket_end
    from generate_series(1, (24*60) ) n;
    

    You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.

    It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.

    0 讨论(0)
  • 2021-01-04 06:52

    In the data warehouses I built I was using separate CALENDAR and TIME_OF_DAY dimensions. The first dimension has a 1 day day granularity and the second one - 1 minute granularity.

    In two other cases I knew beforehand that no reporting would be required at the granularity smaller than 15 minutes. In that case for simplicity I was using a single CALENDAR dimension with 96 records per day.

    I was using this approach in the Oracle warehouses so far, but I might be involved in a PostgreSQL warehouse project this summer.

    0 讨论(0)
提交回复
热议问题