When designing tables, I\'ve developed a habit of having one column that is unique and that I make the primary key. This is achieved in three ways depending on requirements
Here are my own rule of thumbs I have settled on after 25+ years of development experience.
The primary key is used by the database for optimization purposes and should not be used by your application for anything more than identifying a particular entity or relating to a particular entity.
Always having a single value primary key makes performing UPSERTs very straightforward.
Use additional indices to support multi-column keys which have meaning in your application.
I look for natural primary keys and use them where I can.
If no natural keys can be found, I prefer a GUID to a INT++ because SQL Server use trees, and it is bad to always add keys to the end in trees.
On tables that are many-to-many couplings I use a compound primary key of the foreign keys.
Because I'm lucky enough to use SQL Server I can study execution plans and statistics with the profiler and the query analyzer and find out how my keys are performing very easily.
I'll be up-front about my preference for natural keys - use them where possible, as they'll make your life of database administration a lot easier. I established a standard in our company that all tables have the following columns:
SUSER_SNAME()
in T-SQL))Row ID has a unique key on it per table, and in any case is auto-generated per row (and permissions prevent anyone editing it), and is reasonably guaranteed to be unique across all tables and databases. If any ORM systems need a single ID key, this is the one to use.
Meanwhile, the actual PK is, if possible, a natural key. My internal rules are something like:
EventId, AttendeeId
)So ideally you end up with a natural, human-readable and memorable PK, and an ORM-friendly one-ID-per-table GUID.
Caveat: the databases I maintain tend to the 100,000s of records rather than millions or billions, so if you have experience of larger systems which contraindicates my advice, feel free to ignore me!
What is the purpose of a table in a schema? What is the purpose of a key of a table? What is special about the primary key? The discussions around primary keys seem to miss the point that the primary key is part of a table, and that table is part of a schema. What is best for the table and table relationships should drive the key that is used.
Tables (and table relationships) contain facts about information you wish to record. These facts should be self-contained, meaningful, easily understood, and non-contradictory. From a design perspective, other tables added or removed from a schema should not impact on the table in question. There must be a purpose for storing the data related only to the information itself. Understanding what is stored in a table should not require undergoing a scientific research project. No fact stored for the same purpose should be stored more than once. Keys are a whole or part of the information being recorded which is unique, and the primary key is the specially designated key that is to be the primary access point to the table (i.e. it should be chosen for data consistency and usage, not just insert performance).
It was said that primary keys should be as small as necessary. I would says that keys should be only as large as necessary. Randomly adding meaningless fields to a table should be avoided. It is even worse to make a key out of a randomly added meaningless field, especially when it destroys the join dependency from another table to the non-primary key. This is only reasonable if there are no good candidate keys in the table, but this occurrence is surely a sign of a poor schema design if used for all tables.
It was also said that primary keys should never change as updating a primary key should always be out of the question. But update is the same as delete followed by insert. By this logic, you should never delete a record from a table with one key and then add another record with a second key. Adding the surrogate primary key does not remove the fact that the other key in the table exists. Updating a non-primary key of a table can destroy the meaning of the data if other tables have a dependency on that meaning through a surrogate key (e.g. a status table with a surrogate key having the status description changed from ‘Processed’ to ‘Cancelled’ would definitely corrupt the data). What should always be out of the question is destroying data meaning.
Having said this, I am grateful for the many poorly designed databases that exist in businesses today (meaningless-surrogate-keyed-data-corrupted-1NF behemoths), because that means there is an endless amount of work for people that understand proper database design. But on the sad side, it does sometimes make me feel like Sisyphus, but I bet he had one heck of a 401k (before the crash). Stay away from blogs and websites for important database design questions. If you are designing databases, look up CJ Date. You can also reference Celko for SQL Server, but only if you hold your nose first. On the Oracle side, reference Tom Kyte.
I follow a few rules:
On surrogate vs natural key, I refer to the rules above. If the natural key is small and will never change it can be used as a primary key. If the natural key is large or likely to change I use surrogate keys. If there is no primary key I still make a surrogate key because experience shows you will always add tables to your schema and wish you'd put a primary key in place.
All tables should have a primary key. Otherwise, what you have is a HEAP - this, in some situations, might be what you want (heavy insert load when the data is then replicated via a service broker to another database or table for instance).
For lookup tables with a low volume of rows, you can use a 3 CHAR code as the primary key as this takes less room than an INT, but the performance difference is negligible. Other than that, I would always use an INT unless you have a reference table that perhaps has a composite primary key made up from foreign keys from associated tables.