问题
So I have two dimensions in my data warehouse:
dim_machine
-------------
machine_key
machine_name
machine_type
dim_tool
------------
tool_key
tool_name
machine_type
What I want to make sure of is the machine_type field in both dimensions has the same data. Should I create a third dimension to snowflake between the two or is there another alternative?
回答1:
I'm not sure exactly what problem you're trying to solve? This sounds like something that you would simply build into the ETL process: for both dimensions, map your source data to the same target list of machine types. If a new value appears that has no mapping, raise an error (or set a default placeholder value and review the data later).
A completely different option would be a "mini-dimension" (Kimball's term), that holds all possible machine/tool combinations. If two dimensions are closely related and often used together in searches then it can be useful way to consolidate and simplify them. But even then, I assume you will be checking and cleaning the source data to conform the machine types.
回答2:
Keep in mind that a data warehouse is a de-normalized structure, so it is normal for data to repeat in dimensions. The integrity should be provided in the operational system and the ETL process. Suppose, we have something like the model below.
The business process that dispenses tools has to know which tool can be installed on which machine. Suppose a wrong tool is somehow installed on a machine. It is better to import data to match that fact and run a report that will discover a bug in the business process, than to break the ETL process because the tool and machine types do not match.
For example, a query (report) like this wold discover a mismatch and would prove quite useful.
select
'tool-machine mismatch' as alarm
, full_date
, machine_name
, machine_type
, tool_name
, matching_machine_type
, employee_full_name
from fact_installed_tools as f
join dim_machine as m on m.machine_key = f.machine_key
join dim_tool as t on t.tool_key = f.installed_tool_key
join dim_date as d on d.date_key = f.date_key
join dim_employee as e on e.employee_key = f.employee_key
where machine_type != matching_machine_type ;
来源:https://stackoverflow.com/questions/3906275/how-do-i-dimensionally-model-this-relationship-in-a-kimball-style-data-warehouse