What is best practice for representing time intervals in a data warehouse?

后端 未结 3 1933
甜味超标
甜味超标 2021-02-03 12:50

In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a

3条回答
  •  借酒劲吻你
    2021-02-03 13:24

    Well, the standard sql where my_field between date1 and date2 is inclusive, so I prefer the inclusive form -- not that the other one is wrong.

    The thing is that for usual DW queries, these (rowValidFrom, rowValidTo) fields are mostly not used at all because the foreign key in a fact table already points to the appropriate row in the dimension table.

    These are mostly needed during loading (we are talking type 2 SCD here), to look-up the most current primary key for the matching business key. At that point you have something like:

    select ProductKey
    from dimProduct
    where ProductName = 'unique_name_of_some_product'
      and rowValidTo > current_date ;
    

    Or, if you prefer to create key-pipeline before loading:

    insert into keys_dimProduct (ProductName, ProductKey)  -- here ProductName is PK
    select ProductName, ProductKey 
    from dimProduct
    where rowValidTo > current_date ;
    

    This helps loading, because it is easy to cache the key table into memory before loading. For example if ProductName is varchar(40) and ProductKey an integer, the key table is less than 0.5 GB per 10 million rows, easy to cache for lookup.

    Other frequently seen variations include were rowIsCurrent = 'yes' and where rowValidTo is null.

    In general, one or more of the following fields are used :

    • rowValidFrom
    • rowValidTo
    • rowIsCurrent
    • rowVersion

    depending on a DW designer and sometimes ETL tool used, because most tools have a SCD type 2 loading blocks.

    There seems to be a concern about the space used by having extra fields -- so, I will estimate here the cost of using some extra space in a dimension table, if for no other reason then convenience.

    Suppose I use all of the row_ fields.

    rowValidFrom date       = 3 bytes
    rowValidTo   date       = 3 bytes
    rowIsCurrent varchar(3) = 5 bytes
    rowVersion   integer    = 4 bytes
    

    This totals 15 bytes. One may argue that this is 9 or even 12 bytes too many -- OK.

    For 10 million rows this amounts to 150,000,000 bytes ~ 0.14GB

    I looked-up prices from a Dell site.

    Memory ~ $38/GB
    Disk   ~ $80/TB = 0.078 $/GB 
    

    I will assume raid 5 here (three drives), so disk price will be 0.078 $/GB * 3 = 0.23 $/GB

    So, for 10 million rows, to store these 4 fields on disk costs 0.23 $/GB * 0.14 GB = 0.032 $. If the whole dimension table is to be cached into memory, the price of these fields would be 38 $/GB * 0.14GB = 5.32 $ per 10 million rows. In comparison, a beer in my local pub costs ~ 7$.

    The year is 2010, and I do expect my next laptop to have 16GB memory. Things and (best) practices change with time.

    EDIT:

    Did some searching, in the last 15 years, the disk capacity of an average computer increased about 1000 times, the memory about 250 times.

提交回复
热议问题