What is best practice for representing time intervals in a data warehouse?

后端未结

关注

 3  1936

甜味超标 2021-02-03 12:50

In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a

3条回答

情歌与酒 (楼主)

2021-02-03 13:28

Generally I agree with David's answer (voted), so I won't repeat that info. Further to that:

Did you really mean half open ([StartDate,EndDate])

Even in that "half-open", there are two errors. One is a straight Normalisation error that of course implements duplicate data that you identify in the discussion, that is available as derived data, and that should be removed.

To me, Half Open is (StartDate)

EndDate is derived from the next row.

it is best practice

it is not common usage, because (a) common implementors are unaware these days and (b) they are too lazy, or don't know how, to code the necessary simple subquery

it is based on experience, in large banking databases

Refer to this for details:

Link to Recent Very Similar Question & Data Model

Responses to Comments

You seem to clearly favour normalised designs with natural, meaningful keys. Is it ever warranted to deviate from this in a reporting data warehouse? My understanding is that the extra space devoted to surrogate keys and duplicate columns (eg EndDate) are a trade off for increased query performance. However some of your comments about cache utilisation and increased disk IO make me question this. I would be very interested in your input on this.

Yes. Absolutely. Any sane person (who is not learning Computer Science from wiki) should question that. It simply defies the laws of physics.

Can you understand that many people, without understanding Normalisation or databases (you need 5NF), produce Unnormalised slow data heaps, and their famous excuse (written up by "gurus") is "denormalised for performance" ? Now you know that is excreta.

Those same people, without understanding Normalisation or datawarehouses (you need 6NF), (a) create a copy of the database and (b) all manner of weird and wonderful structures to "enhance" queries, including (c) even more duplication. And guess what their excuse is ? "denormalised for performance".

It is criminal, and the "gurus" are no better, they validate it.

I would say those "gurus" are only "gurus" because they provide a pseudo scientific basis that justifies the non-science of the majority.

false information does not get any truer by repeating it, and God knows they repeat it ad infinitum .

The simple truth (not complex enough for people who justify datawarehouses with (1) (2) (3) ), is that 6NF, executed properly, is the data warehouse. I provide both database and data warehouse from the same data, at warehouse speeds. No second system; no second platform; no copies; no ETL; no keeping copies synchronised; no users having to go to two sources. Sure, it takes skill and an understanding of performance, and a bit of special code to overcome the limitations of SQL (you cannot specify 6NF in DDL, you need to implement a catalogue).

why implement a StarSchema or a SnowFlake, when the pure Normalised structure already has full Dimension-Fact capability.
.

Even if you did not do that, if you just did the traditional thing and ETLed that database onto a separate datawarehouse system, within it, if you eliminated duplication, reduced row size, reduced Indices, of course it would run faster. Otherwise, it defies the laws of physics: fat people would run faster than thin people; a cow would run faster than a horse.

fair enough, if you don't have a Normalised structure, then anything, please, to help. So they come up with StarSchemas, SnowFlakes and all manner of Dimension-Fact designs.

And please understand, only un_qualified, in_experienced people believe all these myths and magic. Educated experienced people have their hard-earned truths, they do not hire witch doctors. Those "gurus" only validate that the fat person doesn't win the race because of the weather, or the stars; anything but the thing that will solve the problem. A few people get their knickers in a knot because I am direct, I tell the fat person to shed weight; but the real reason they get upset is, I puncture their cherished myths, that keep them justified being fat. People do not like to change.

One thing. Is it ever warranted to deviate. The rules are not black-or-white; they are not single rules in isolation. A thinking person has to consider all of them together; prioritise them for the context. You will find neither allIdiot keys , nor zero Idiot keys in my databases, but every Id key has been carefully considered and justified.

By all means, use the shortest possible keys, but use meaningful Relational ones over Surrogates; and use Surrogates when the key becomes too large to carry.

But never start out with Surrogates. This seriously hampers your ability to understand the data; Normalise; model the data.

Here is one ▶question/answer◀ (of many!) where the person was stuck in the process, unable to identify even the basic Entities and Relations, because he had stuck Idiot keys on everything at the start. Problem solved without discussion, in the first iteration.
.

Ok, another thing. Learn this subject, get experience, and further yourself. But do not try to teach it or convert others, even if the lights went on, and you are eager. Especially if you are enthusiastic. Why ? Because when you question a witch doctor's advice, the whole village will lynch you because you are attacking their cherished myths, their comfort; and you need my kind of experience to nail witch doctors (just check for evidence of his in the comments!). Give it a few years, get your real hard-won experience, and then take them on.

If you are interested, follow this ▶question/answer◀ for a few days, it will be a great example of how to follow IDEF1X methodology, how to expose and distil those Identifiers.

0 讨论(0)

查看其它3个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复