I need to store a set of entities, of which there are several specialized versions. They have some common properties, but the specialized ones contain properties specific fo
This technique can have some issues with proper design and performance, but you seem to need a compromise with flexibility.
Instead of having to create specific tables or even a new field for each property you could have a table (And it will be quite large) for unique properties:
Unique_Property_ID
, FK_To_Some_Entity (Not sure what entity these link to: customers, bills, etc.)
, Property_Type Not the data type but a link to a table describing this entity)
, Property_Value (Difficult to determine if all of your values will be of the same type: string, int, date, etc.)
Example: A used car dealership needing to track add ons for different makes and models (They never know what they're going to get). A few records may look like:
VehicleID Property Type Property Value
3 Sound System Bose
3 Hybrid System Battery
7 Starter Type Hand Crank
7 Passenger Seat Rumble
9 Starter Type Kick Start
9 Passenger Seat Side Car
Each value for each Set | property type would have its own record.
Another problem is when you do want each property to be represented as a column, you'll need to transpose this table.
This is a classic example of the gen-spec design pattern. Gen-spec is covered in every tutorial on object modeling, because it is handled by inheritance. It's frequently skipped over in tutorials on relational data modeling. But it's well understood.
Do a web sreach on "generalization specialization relational modeling". You'll see several articles on how to set up a single table for the general class and a table for each specialized class. The articles will help you with foreign key design. In particular, the primary key of each specialized table does double duty. It's also a foreign key to the generalized table.
This won't look very familiar to you if you are used to object modeling. But you'll find it works well. And the solution offered by most of the articles is not dynamic, so you'll have to do some DDL every time a new specialized subclass is discovered.
Anders, do not give up any integrity or hardness, eg type safety.
(Response coming).
@Anders. No, not at all, subtyping is fine (the question is which form you use and what are the dis/advantages). Do not give up any strength or Integrity or type safety or checks or DRI. The form you choose will demand additional Checks and maybe a bit of code (depends on your platform).
This subject is coming up frequently, but the seeker always has a narrow perspective; I keep making the same statements (a subset) from an unchanging set. The idea is to evaluate all the options. So I am writing a doc. Unfortunately it is taking longer. Maybe 4 pages. Not ready to post. But the diagrams are finished, I think you are on the ball, and you can use it right away.
Warning: Experienced Project Construction Engineers Only
Road not suitable for caravans or readers with a high Eek factor
Link to ▶Four Alternative Data Models◀ in Document Under Construction. Apologies for the mess on the floor; I will clean up soon.
▶Link to IDEF1X Notation◀ for anyone who is unfamiliar with the Standard for modelling Relational databases.
They are all Relational, with full integrity.
The 6NF options. Relational today (SQL) does not provide support for 6NF; it does not disallow it, it just does not provide the 5NF➔6NF structures. Therefore you need to build a small catalogue, what some people call "metadata". Really, it is just an extension of the standard SQL catalogue (sys tables). The level of control required is modelled in each option.
Essentially EAV done properly, with full control and integrity (type safety, Declarative Referential Integrity, etc) rather than the mess it usually is.
You may be interested in these related question/answers (in particular, look at the Data Models):
Multiple Fixed vs Abstract Flexible
Database Schema-Related Problem
"Simple" Database Design Problem
... That way, we can easily grab "Comment" rows associated with a given specialized type instance. Is this the way to do that, or will I regret that decision later? Is there any other pattern we're missing?
Not sure what you mean. Comments, Notes, Addresses, end up being used (columns resident in) in many tables, so the correct method is to Normalise them; provide One Table for Comment; that is referenced from any table that requires it. Here is a generic Comment table. It is used in Product (the supertype) because you stated any Product. It can just as easily be used in some of the Product subtypes, and not others; in which case the FK will be in said Product Subtypes.
Your Data Model
What is the purpose of the ProductType table in your Product 5NF/subtype example? Does it contain a row corresponding to each specialized Product, e.g., ProductCPU? I assume it indicates which specialization the base product is.
(Small critical mistake in the diagram, corrected.)
Yes, exactly.
In Standard Relational terms (not the uncontrolled messes passing off as databases), the ProductType is the Discriminator; it identifies which of the Product Subtypes apply to this Product. Tells you which Product Subtype table you need to join with. The pair together make a logical Product. Do not forget to produce the Views, one for each ProductType.
(Do evaluate how ProductType changes, exactly what role it plays, for each of the four Data Models.)
"Generalisation-specialisation" is all mumbo jumbo, OO terminology; without crossing the line and learning what Relational has been capable of for 30 years. If you learn a little about Relational, you will have the full power; otherwise you are limited to the very limited OO approach to everything (Ambler and Fowler have a lot to answer for). Please read this post, from 11 Dec 10 onwards. Relational databases model Entities, not objects; not classes.
For example, when adding a new product you'll want to provide, say, a dropdown selection of which product types it is possible to add. Based on this selection, it can be deduced which tables to put the data in. Correct? I'm sorry for talking about application code, but I just need to put it into perspective
Yes. And what page (with fields) to provide next, for the user to enter data.
No problem talking about the app code that will use the Rdb, they go together like husband and wife (not husband and slave).
For your OO classes, map the Class tree to the Rdb, once you have finished modelling the Rdb, independent of any app that will use it. Not the other way around. And not dependent on one app.
Forget about "persisting", it has many problems (Lost Updates; damaged data integrity; problematic debugging; massive contention; etc). All updates to the Rdb should be in Transactions, with ACID compliance, available for 30 years, but Fowler and Ambler have not read about it yet. Usually that means one stored proc pre xact.
The discriminant is a FK to a Type-table as we established earlier. It denotes which
spec.sub type the base type adheres to. But what does the discriminant table contain in detail?
Is that not clear from the data model ? ProducType CHAR(1)
or (2). Name Char(30)
.
Could be a display-friendly text stating the type for UI-purposes,
Yes, among other things, such as the control, contraint, etc, elimination of ambiguity when coding or reporting.
but does it also contain the exact table name which contains the specialized type?
No. That would be a little too physical to be placed in data. Disallowed on principle.
But it is not necessary.
Say I'm interested in the Product with ID = 1. It has a discriminant indicating that it is a ProductCPU. How would you go about retrieving this ProductCPU from your app code?
That will be easy if you take the provided model, and implement it (all the tables) as classes, correctly, etc. The example you request will not use Views (which are for lists, and more generic use). The pseudo-code would be:
ProductId
(Subtype unknown, therefore your should not be sitting a a Subtype-specific window), load the Product
supertype onlybased on the Discriminator Product.ProductType
, set indicators, etc, and load the applicable subtype, one of ProductCPU; ProductMemory; ProductDisk; ProductTape
; etc.
I have seen (and do not agree with) OO methods that load all subtypes for the given ProductId
at once: one subtype is valid; and the rest are invalid. The code still has to constrain itself to the valid class for the Product
based on Product.ProductType
.
Alternately, eg. where the context is, the user is sitting in a Subtype-specific window, eg. ProductCPU
, with that class set up, and requests ProductId
xxx. Then use the ProductCPU
View. If it returns zero rows, it does not exist.
ProductDisk
xxx, but not a ProductCPU
xxx. How you handle that, whether you indicate there is a Product`xxx but it isn't a CPU, or not, that depends on the app requirements.For lists, where the app fills in a grid, without regard to the ProductId
, use the views (one each) to load each grid. That SQL is based on the join, and does not need to refer to ProductType
.
Here are five examples from the SO. Which one to use would depend on the actual problem you are solving and you preference.
In general I would advise against serializing data into DB fields.
If there are relatively few specialized versions of an entity you could simply use sub-typing like in these examples: example one, example two.
For large values of properties, or if the properties are to be defined dynamically (no schema changes), look into an implementation of the observation pattern like in these examples: example three, example four ; or into the so-called 6th normal form as described in: example five. Note that the "observation pattern" is a simplified version of the 6th NF.
I would go for the "create a table for the common properties and then a table for each of the specialized versions" method, personally.
Reason: you say that your implementation will be done in a RDBMS and this is non-negotiable. Fine. Dumping unstructured, blob-like stuff like a serialized hashtable in a DB field goes against the design philosophy of RDBMS though, so you will have a severe hit on efficiency unless you are ok with the idea of treating the *extended_properties* field as an opaque blob, just like a gif or another binary object.
In other words, forget querying (efficiently) for "all the objects having extended property COLOR=RED".
The problem you have (describing OO taxonomies in a RDBMS) is definitely not new. Have a look at this, for a in-depth description of the options.