I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia p
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key. For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key. For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns. For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId
is a foreign key to Friends.Id
)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.) A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress)
is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.