How do you determine how far to normalize a database?

北城以北 提交于 2019-11-27 20:38:13
sergeb

You want to start designing a normalized database up to 3rd normal form. As you develop the business logic layer you may decide you have to denormalize a bit but never, never go below the 3rd form. Always, keep 1rd and 2nd form compliant. You want to denormalize for simplicity of code, not for performance. Use indexes and stored procedures for that :)

The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.

There are a couple of good articles:

http://www.agiledata.org/essays/dataNormalization.html

@GrizzlyGuru A wise man once told me "normalize till it hurts, denormalize till it works".

It hasn't failed me yet :)

I disagree about starting with it in un-normalized form however, in my experience its' been easier to adapt your application to deal with a less normalized database than a more-normalized one. It could also lead to situations where its' working "well enough" so you never get around to normalizing it (until its' too late!)

Normalization means eliminating redundant data. In other words, an un-normalized or de-normalized database is a database where the same information will be repeated in multiple different places. This means you have to write more complex update statement to ensure you update the same data everywhere, otherwise you get inconsistent data which in turn means the output of queries is unrealiable.

This is a pretty huge problem, so I would say denormalization hurts, not the other way around.

In some case you may deliberately decide to denormalize specific parts of a database, if you judge that the benefit outweighs the extra work in updating data and the risk of data corruption. For example with datawarehouses, where data is aggregated for performance reasons, and data if often not updated after the initial entry which reduce the risk of inconsistencies.

But in general be weary of denormalizing for performance. For example the performance benefit of a denormalized join can typically be achieved by using materialized view (also called indexed view), which will be as fast as querying a denormalized table, but still protects the consistency of the data.

Jeff has a pretty good overview of his philosophy on his blog: Maybe normalization isn't normal. The main thing is: don't overdo normalization. But I think an even bigger point to take away is that it probably doesn't matter too much. Unless you're running the next Google, you probably won't notice much of a difference until your application grows.

Database normizational I feel is an art form.

You don't want to over normalize your database because you will have too many tables and it will cause your queries of even simple objects take longer than they should.

A good rule of thumb I follow is to normalize the same information repeated over and over again.

For example if you are creating a contact management application it would make sense to have Address (Street, City, State, Zip, etc. . ) as its own table.

However if you have only 2 types of contacts, Business or personal, do you need a contact type table if you know you are only going to have 2? For me no.

I would start by first figuring out the datatypes you need. Use a modeling program to help like Visio. You don't want to start with a non-normalized database because you will eventually normalize. Start by putting objects in there logical groupings, as you see data repeated take that data into a new table. I would keep up with that process until you feel you have the database designed.

Let testing tell you if you need to combine tables. A well written query can cover any over normalization.

I believe starting with an un-normalized database and moving toward normalized as you progress is usually easiest to get started. To the question of how far to normalize, my philosophy is to normalize until is starts to hurt. That may sound a little flippant, but it generally is a good way to gauge how far to take it.

Having a normalized database will give you the most flexibility and the easiest maintenance. I always start with a normalized database and then un-normalize only when there is an real life problem that needs addressing.

I view this similarly to code performance i.e. write maintainable, flexible code and make compromises for performance when you know that there is a performance problem.

The original poster never described in what situation the database will be used. If it's going to be any type of data warehousing project where at some point you will need cubes (OLAP) processing data for some front-end, it would be wiser to start off with star schema (fact tables + dimension) rather than looking into normalization. The Kimball books will be of great help in this case.

I agree that it is typically better to start out with a normalized DB and then denormalize to solve very specific problems, but I'd probably start at Boyce-Codd Normal Form instead of 3rd Normal Form.

The truth is that "it depends." It depends on a lot of factors including:

  • Code (Hand-coded or Tool driven (like ETL packages))
  • Primary Application (Transaction Processing, Data Warehousing, Reporting)
  • Type of Database (MySQL, DB/2, Oracle, Netezza, etc.)
  • Database Architecture (Tablular, Columnar)
  • DBA Quality (proactive, reactive, inactive)
  • Expected Data Quality (do you want to enforce data quality at the application level or the database level?)

I agree that you should normalise as much as possible and only denormalise if absolutely necessary for performance. And with materialised views or caching schemes this is often not necessary.

The thing to bare in mind is that by normalising your model you are giving the database more information on how to constrain your data so that you can remove the risk of update anomalies that can occur in incompletely normalised models.

If you denormalise then you either need to live with the fact that you may get update anomolies or you need to implement the constraint validation yourself in your application code. This takes away a lot of the benefit of using a DBMS which lets you define these constraints declaratively.

So assuming the same quality of code, denormalising may not actually give you better performance.

Another thing to mention is that hardware is cheap these days so throwing extra processing power at the problem is often more cost effective than accepting the potential costs of cleaning up corrupted data.

Often if you normalize as far as your other software will let you, you'll be done.

For example, when using Object-Relational mapping technology, you'll have a rich set of semantics for various many-to-one and many-to-many relationships. Under the hood that'll provide join tables with effectively 2 primary keys. While relatively rare, true normalization often gives you relations with 3 or more primary keys. In cases like this, I prefer to stick with the O/R and roll my own code to avoid the various DB anomalies.

Just try to use common sense.

Also some say - and I have to agree with them - that, if you're finding yourself joining 6 (the magic number) tables together in most of your queries - not including reporting related ones- , than you might consider denormalizing a bit.

Don't forget The mother of all database normalization debates on Coding Horror (summarized on the High Scalability blog).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!