When would a database design be described as overnormalized? Is this characterization an absolute one? Or is it dependent on the way it is used in the application? Thanks.
Third Normal Form (3NF) is considered the optimal level of normalization for many a rational database application. This is a state in which, as Bill Kent once summarized, every "non-key field [in every table within a particular a relational database management system, or RDBMS] must provide a fact about the key, the whole key, and nothing but the key." 3NF is a term that was introduced by E.F. Codd, inventor of the relational model for database management. Generally, the data that a software application is dependent on, especially an application used for an Online Transaction Processing System (OLTP), will fare well in 3NF. This normal form by definition reduces database size by calling for a minimum repetition of row/column data, and maximizes query efficiency and ease of application maintenance. 3NF achieves that by requiring that a database's tables (i.e., its schema) be broken down into separate tables related by primary/foreign keys--basically until Kent's rule holds true (well, I've stated it this way for ease of reading but the actual definition of 3NF is much more detailed than that). In contrast, overnormalization implies increasing the number of joins required in a query between related tables. This comes as a result of breaking down the database schema into a much more granular level than 3NF. However, though normalization past the 3rd degree can often be considered overnormalization, the negative connotation of the term "overnormalization" can sometimes be unwarranted. Overnormalization may be desirable in some applications which by design require 4NF (and beyond) due to the complexity and versatility of the application software. An example of that is a highly customizable and extensible commercial database program for some industry in which it is sold to end users requiring an open API. But then the reverse can be desirable as well--that is, denormalization--most notably, when designing an Online Analytical Processing (OLAP) database used strictly to summarize data from an OLTP database just for querying/reporting--such as a data warehouse. In this case the data must by necessity reside in a highly denormalized format (i.e, 1NF, or 2NF). It's often under these constraints--when there are high demands for efficient querying and reporting--that we find database and application programmers calling a database, "overnormalized". But as Redgate's Tony Davis once said--taking into account today's much more advanced and efficient database software and storage systems--"the performance hit from multiple joins in a query is negligible. If your database is slow, it isn’t because it is ‘over-normalized’!" So in conclusion, this characterization--overnormalization--isn't an absolute one, and it is dependent on the way it is used in the application. In Kent's words, "The normalization rules are designed to prevent update anomalies and data inconsistencies. . . [but] there is no obligation to fully normalize all records when actual performance requirements are taken into account. . . The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. . . [Thus,] the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications."
If performance is affected by too many joins, creating de-normalized tables for reporting purposes can speed things up. By copying the data into new tables, it may be possible to run reports with no joins at all.
..or hitting limits on the number of joins your RDBMS will do.
My take on this:
Always normalize as much as you are able to do. I usually go crazy on normalization, and try to design something that could handle every thinkable future extensions. What I end up with is a database design that is extremely flexible... and impossible to implement.
Then the real job starts: De-normalization. Here you solve what you know would be problematic to implement and/or would slow the queries down because of too many joins.
This way you know what you scarify for make the design usable.
Edit: Documentations! I forgot to mention that documenting the de-normalization is very important. It is extremely helpful when you take over a project to know the reason behind the choices.
It's always a question of the application domain. It's usually a question of correctness, but occasionally a question of performance.
There's one case where I can think of a prima facie case of overnormalization: say you have an order + orderitem, and the orderitem references productID, and leaves pricing to the product.price. Since that introduces temporal coupling, you've incorrectly normalized because the overnormalization affects already shipped orders, unless prices absolutely never change. You can certainly argue that this is simply a modeling error (as in the comments), but I see under-normalization as a modeling error in most cases, too.
The other category is performance related. In principle, I think there are generally better solutions to performance than denormalizing data, such as materialized views, but if your application suffers from the performance consequences of many joins, it may be worth assessing whether denormalizing can help you. I think these cases are often over-emphasized, because people sometimes reach for denormalization before they properly profile their application.
People also often forget about alternatives, like keeping a canonical form of the database and using warehousing or other strategies for frequently-read, but infrequently changed data.