问题
Let's say that we have D_PROCESS
, D_WORKER
and D_STATUS
as dimensions, and the fact F_EVENT
that links a process (what) with a worker (who's in charge) and the "current" status.
The process status changes over time. Shoud we store in F_EVENT
one line per process/status/worker, or one line per process/worker, and "somewhere else" one line per status change for a given process/worker?
I'm new to Datawarehouse and it's hard to find best practices/tutorial related to data modelization.
回答1:
Read The Data Warehouse Toolkit by Ralph Kimball for a good introduction to dimensional modeling.
It sounds like you are storing a process change event in F_EVENT. If this process has a defined beginning and end, I would build a snapshot fact table which would let you track the process over time (simply updating the row each time the process moves from one step to another).
EDIT:
I'll try to make a general case using your dimensions as examples.
For D_PROCESS, modeling a "process" isn't usually modeled as a dimension, and you called it a "what", so I'm going to rename this to "D_ACCOUNT".
The basic data model will be for a "tax processing system" in which WORKERS are processing ACCOUNTS, and each ACCOUNT/WORKER combination has several possible "STATUSES" of where this process currently stands.
D_ACCOUNT
ACCOUNT_NUMBER
ACCOUNT_TYPE
D_WORKER
WORKER_ID
FIRST_NAME
LAST_NAME
BADGE_NUMBER
SHIFT
D_STATUS
STATUS_ID
STATUS_NAME
Now if I want to report on all "events" that have happened to an Account, performed by a worker, I can build a Transaction-level fact table F_EVENT:
F_EVENT
ACCOUNT_ID
WORKER_ID
STATUS_ID
EVENT_TIME_ID
Metrics taken at time of the measurement (Cost, Worker time spent, etc)
We call the unique combination of dimensions that identifies a row the Granularity or Grain of the fact table.
The grain of this table is Account, Worker, Status, and Time. It answer questions like "How much time did my workers on shift 3 spend processing accounts on Wednesday?" or "How many events occured that changed the processing status to "CLOSED"?
I'm not sure how much this type of table would help.
Instead, say you are interested in tracking the process itself as it moves through various statuses. I'm going to assume that the status always moves forward in time, from "NOT STARTED" to "IN PROCESS" to "CLOSED".
I'll build what Kimball calls an "Accumulating Snapshot Fact table.
F_TAXPROCESSING
ACCOUNT_ID
WORKER_ID
CURRENT_STATUS_ID
NOT_STARTED_DTTM
NOT_STARTED_FLAG
IN_PROCESS_DTTM
IN_PROCESS_FLAG
CLOSED_DTTM
CLOSED_FLAG
This table's grain is Account, Worker. This table keeps track of the "process" by updating the date/time of the change to the status, and a flag when that status has been reached.
This allows you to track the process over time, allowing you to see how many accounts have reacted the "IN PROCESS" status, how long it took to get there, et cetera.
来源:https://stackoverflow.com/questions/11114988/how-to-model-process-and-status-history-in-a-data-warehouse