问题
Lets say, there is a SCD2 dimension table - location. The natural key is country, state and city combined. Since it is SCD2 table, eff date is also part of the key.
Is it better to have the surrogate key as usavirginarichmond20110101 or create an actual numerical key using row_number() in hive?
Why one approach is better over another?
回答1:
(Note on terminology: combination of natural keys is called "composite key", not surrogate key, and it's still a "natural key". Surrogate key (aka Synthetic key) is a sequential integer that has no business meaning).
Short answer: since your dimension is SCD2, definitely use surrogate/synthetic keys. Handling SCD with natural/composite keys is a pain.
Longer answer: Surrogate (SK) vs Natural keys (NK) design is an on-going debate. Each has pros and cons. My approach is to always use surrogate keys in data warehouse (DW). It means some extra ETL work, but that's an acceptable cost because surrogate keys have some important advantages:
SCD handling is much easier. If you have SCDs, using natural keys is rather cumbersome and ugly. Synthetic keys don't have the problem;
System-wide consistency: because of SCD, it's highly likely that you will have to use SKs in your Data Warehouse at least in some tables. It makes sense then to consistently use them in all tables. Mixing SK and NK designs is ugly;
Composite NKs can often be large and complex alpha-numeric strings. It means that they might substantially increase table sizes, and joins might be slower. SK is a simple integer, with predictable size and consistent join speed;
NKs can be a source of bugs and instability in DW. For example, some databases re-use their natural keys, and as a result their meaning might change over time. In DW that relies on NKs that's a potential disaster. Also, NKs might come from a wide variety of sources, and lead to integration conflicts.
There are other considerations, but in my experience, systematically using surrogate keys makes DW design more reliable and efficient.
回答2:
You can partition by effective_date for faster filtering/joining only with partitions only with effective date.
And what surrogate key like this usavirginarichmond20110101
will give you ? Full scans because filtering will be on substr. So, keep country, state, city and effective_date
separately as a key and partition by effective_date
.
And one more important point: numerical key using row_number() in hive is not good solution because it's generation is running not in distributed mode. Better use GUID for this purpose.
来源:https://stackoverflow.com/questions/51874230/is-it-better-to-have-a-surrogate-key-or-nkeffective-time-in-dimension-tables-in