Why NULL values are mapped as 0 in Fact tables?

后端 未结 4 1930
深忆病人
深忆病人 2020-12-21 00:40

What is the reason that in measure fields in fact tables (dimensionally modeled data warehouses) NULL values are usually mapped as 0?

相关标签:
4条回答
  • 2020-12-21 01:16

    It depends upon what you're modeling, but in general it's to avoid complications with performing aggregates. And in many scenarios it makes sense to treat NULL as 0 for those purposes.

    For example, a customer with NULL orders for a given period of time. Or a sales person with NULL sales revenue (shame on him!).

    0 讨论(0)
  • 2020-12-21 01:18

    NULL instead of 0 should be used if you intend to do an average on your fact column. This is the only time i believe NULLS are ok in a dwh fact or dimensions

    if a fact value is unknown/late arriving, then leaving as NULL is best.

    aggregate functions suchs as MIN,MAX work on NULLS simply ignoring them

    (For the record one of Ralph Kimball's sidekicks said this in his course I intended)

    with goodf as
    (
    select 1  x
    union all
    select null 
    union all
    select 4
    )
    select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
    from goodf
    
    
    with badf as
    (
    select 1  x
    union all
    select 0 /* unknown */ 
    union all
    select 4
    )
    select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
    from badf
    

    in badf above the average comes out incorrect as it uses the zero of the unknown value as literally 0

    0 讨论(0)
  • 2020-12-21 01:25

    The main reason is that the database treats nulls differently from blanks or zeros, even though they look like blanks or zeros to the human eye.

    Here is a link to an old design tip by Ralph Kimball on the same topic.

    This blogpost talks about avoiding nulls in measures and gives a couple of suggestions.

    0 讨论(0)
  • 2020-12-21 01:31

    Although you've already accepted another answer, I would say that using NULL is actually a better choice, for a couple of reasons.

    The first reason is that aggregates return the 'correct' answer (i.e. the one that users tend to expect) when NULL is present but give the 'wrong' answer when you use zero. Consider the results from AVG() in these two queries:

    -- with zero; gives 1.5
    select SUM(measure), AVG(measure)
    from
    (
    select 1.0 as 'measure'
    union all
    select 2.0
    union all
    select 3.0
    union all
    select 0
    ) dt
    
    -- with null; gives 2
    select SUM(measure), AVG(measure)
    from
    (
    select 1.0 as 'measure'
    union all
    select 2.0
    union all
    select 3.0
    union all
    select null
    ) dt
    

    If we assume that the measure here is "number of days to manufacture item" and NULL represents an item that is still being produced then zero gives the wrong answer. The same reasoning applies to MIN() and MAX() too.

    The second issue is that if zero is a default value, then how do you distinguish between zero as a default and zero as a real value? For example, consider a measure of "shipping charges in EUR" where NULL means that the customer picked up the order himself so there were no shipping charges and zero means the order was shipped to the customer for free. You can't use zero to replace NULL without completely changing the meaning of the data. You can obviously argue that the distinction should be clear from other dimensions (e.g. shipping method) but that adds more complexity to reports and understanding the data.

    0 讨论(0)
提交回复
热议问题