How to set up Primary Keys in a Relation?

问题

I wish to know how to correctly set up Primary Keys in a Relation. E.g. we have ER-diagram which contain elements:

Key attributes
Weak key attributes
Identifying relationships
Associative entities

In order to translate it into Relational Model we should do some tricks. All elements above deal with Primary Keys of relations but they all are Natural Keys - so we can leave them as is or replace with Surrogate Keys.
Consider some cases.

Case 1

Key Attribute is a name - so it must be of type CHAR or VARCHAR. Generally names become Key Attributes.

Case 2

Two (or more) Identifying Relationships become a Composite Primary Key of a relation (which is made of Foreign Keys).

Case 3

Identifying Relationship(s) with Weak Key Attribute(s) also become a Composite Primary Key.

Case 4

Associative entities usually have two or more Identifying Relationships so they are to be Junction Relations (Junction Tables).

How to set up primary keys for Relations in order to handle all above cases (perhaps some more cases which I did not mention)?
How to avoid using surrogate keys and in which cases are they necessary?
How to set up datatypes for primary keys?
If a composite primary key has to be passed into child relation, shall it be replaced with a surrogate?

Advantages and disadvantages of using surrogate keys in my view:

Advantages

They're compact (usually of type INT) and are sometimes good replacement for Composite Keys
They're illustrative when they're in Foreign Keys
They're painlessly indexed

Disadvantages

They're numbers and meaningless. E.g. I wish to fill up Junction Table in my Interface Application - so I will be left no other choice but to relate just numbers
They're redundant
They're confusing

As for setting up datatypes - there must be more tricks as well as setting up primary keys as whole.

Update

I should have given an example initially, but I did not. So here's an example. Consider we have two main entities which interact with each other (still don't know how to illustrate such things as diagrams here - so I'll show them as tables which are to demonstrate International Space Station crew rotation system):

SpaceShip

╔════════════════╤════════════════╗
║ ShipName       │ ShipType       ║ ShipName - Primary Key
╟────────────────┼────────────────╢ ShipType - Foreign Key (but it is
║ Soyuz TMA-14   │ Soyuz          ║   not being considered here)
║ Endeavour      │ Space Shuttle  ║
║ Soyuz TMA-15M  │ Soyuz          ║
║ Atlantis       │ Space Shuttle  ║
║ Soyuz TM-31    │ Soyuz          ║
║ ...            │ ...            ║
╚════════════════╧════════════════╝

And the Crew

╔════════╤══════════╗
║ CrewId │ SallSign ║ CrewId - Primary Key (used Id 'case crew is usually
╟────────┼──────────╢   shown as crew members - it has no particular
║ 4243   │ Astreus  ║   name)
║ 4344   │ Altair   ║ CallSign - attribute (it may not be assigned or
║ 4445   │ ...      ║   explicitly shown - i.e. it can be NULL)
║ ...    │ ...      ║
╚════════╧══════════╝

These two entities interact via Flight. Each flight delivers to the ISS one crew and returns another or the same crew. Obviously relationship between the Flight and Crew is many-to-many and it needs junction relation (table). But we can not just relate the SpaceShip and the Crew because of spaceships - spaceship can be reusable (returnable) such as Space Shuttles were.

So the Flight should look like:

╔═══════════════╤════════════╤═══════════════╤═════╗
║ ShipName      │ FlightName │ ShipFlightNum │ ... ║ ShipName, FlightName
╟───────────────┼────────────┼───────────────┼─────╢   are composite PK
║ Soyuz TM-31   │ NULL       │ 1             │ ... ║ ShipFlightNum
║ Atlantis      │ STS-117    │ 28            │ ... ║   depends on whole
║ Soyuz TMA-14  │ NULL       │ 1             │ ... ║   Composite PK
║ Endeavour     │ STS-126    │ 22            │ ... ║ ... - other
║ Soyuz TMA-15M │ NULL       │ 1             │ ... ║   attributes which
║ Endeavour     │ STS-111    │ 18            │ ... ║   depend on PK
║ Atlantis      │ STS-122    │ 29            │ ... ║
║ ...           │ ...        │ ...           │ ... ║
╚═══════════════╧════════════╧═══════════════╧═════╝

So Flight has Composite Primary Key (flight name for Soyuz vehicle the same as the spacecraft's name but it differs for reusable spacecrafts such as Space Shuttle) and it needs to be related with Crew as many-to-many. Here is the part of my complex question - if this composite Primary Natural Key should be replaced with Surrogate one?
And if we're going to work with Natural Keys further then new Junction Relation (Associative Entity) should look like:

Designation (Crew is Designed to the Flight)

╔═══════════════╤════════════╤════════╤══════════╗
║ ShipName      │ FlightName │ CrewId │ CrewType ║
╟───────────────┼────────────┼────────┼──────────╢
║ Soyuz TMA-15M │ NULL       │ 4243   │ Deliver  ║
║ Soyuz TMA-15M │ NULL       │ 4243   │ Return   ║
║ Soyuz TMA-15M │ NULL       │ 4445   │ Backup   ║
║ Soyuz TMA-16M │ NULL       │ 4344   │ Deliver  ║
║ Soyuz TMA-17M │ NULL       │ 4445   │ Deliver  ║
║ Soyuz TMA-18M │ NULL       │ 4344   │ Return   ║
║ Endeavour     │ STS-111    │ 55     │ Deliver  ║
║ Endeavour     │ STS-111    │ 44     │ Return   ║
║ Endeavour     │ STS-113    │ 55     │ Return   ║
║ ...           │ ...        │ ...    │ ...      ║
╚═══════════════╧════════════╧════════╧══════════╝

Here we have 4x Composite Primary Key which is made up of four Foreign Keys (CrewType also have FK constraint). If we use Surrogates instead of Naturals then result will be more compact but hard to fill up (in my view).

One more update

Another case for table (relation) TypeCrew:

╔══════════╗
║ CrewType ║
╟──────────╢
║ Deliver  ║
║ Return   ║
║ Backup   ║
║ ...      ║
╚══════════╝

Everyhing would be fine if only we had not to use these values in our queries (WHERE CrewType LIKE 'Backup'). If these values will be replaced with alternative meanings in other languages or even with symbols e.g. >, < and ^ for Deliver, Return and Backup respectively (WHERE CrewType LIKE '^'). Adding numerical Surrogate Key will not help much as its values may mismatch with TypeName (WHERE TypeId=2):

╔════════╤══════════╗    ╔════════╤══════════╗    ╔════════╤══════════╗
║ TypeId │ TypeName ║    ║ TypeId │ TypeName ║    ║ TypeId │ TypeName ║
╟────────┼──────────╢    ╟────────┼──────────╢    ╟────────┼──────────╢
║ 0      │ Deliver  ║    ║ 0      │ Backup   ║    ║ 0      │ >        ║
║ 1      │ Return   ║    ║ 1      │ Deliver  ║    ║ 1      │ <        ║
║ 2      │ Backup   ║    ║ 2      │ Return   ║    ║ 2      │ ^        ║
║ ...    │ ...      ║    ║ ...    │ ...      ║    ║ ...    │ ...      ║
╚════════╧══════════╝    ╚════════╧══════════╝    ╚════════╧══════════╝

Perhaps this is not a question of Relational Model? Perhaps it's just bad design? But I could not devise better.

回答1:

Position

Any practice that is not based on solid theory is not worthy of consideration. I am a strict Relational Model practitioner, with a strong grounding in the theory. The Relational Model is based on solid theory, and has never been refuted¹. There is nothing solid in what passes for "relational theory", I have taken them on, and refuted their notions in their space. Further, Relational Database design is a science, not magic, not art², therefore I can provide evidence for any of the propositions or charges that I make. My answers are from that position.

^{1. The are non-science articles, and masses of opinions from those who do not understand the science, yes, but no scientific refutation. Much like pygmies arguing that man cannot fly, it is "true" for them, but not true for mankind, it is based on a complete inability to understand the principle of flight.}

^{2. There is some art in the presentations of high-end practitioners, yes, but that does not make the science an art. It is a science, and only a science, and over and above that, it can be artfully delivered, in models and databases.}

"Relational Theory"

I wish to know how to correctly set up Primary Keys in a Relation. E.g. we have ER-diagram which contain elements:

If it was an ERD, then you wouldn't be looking at "relations", you would be looking at entities (if the diagram was early) or tables (if it were progressed). "Relations" are a wonderful abstraction which have nothing to do with an implementation. An ERD or a Data Model means an implementation (non-abstract, real) is intended, the intention to the physical leaves the abstract world of theory behind, and enters the physical world, where idiotic abstractions get destroyed.

Further the "theoreticians" who allege to be serving the database space cannot differentiate between base relations and derived relations: while that might be acceptable in the abstract context, it is dead wrong in the implementation context. Eg. base relations are tables, and they need to be Normalised; derived relations are, well, derived, views, of base relations, which by definition are flattened views (not "denormalised", which means something slightly different) of base relations. As such, they need not be Normalised.

But the "theoreticians" try to "normalise" derived relations. And the most damaged two are trying to have the definition of 1NF, that we have had for forty five years, that is fundamental and rock solid, that they themselves have supported, changed, so that their derived relations, which do not need "normalisation", can be classified as "normalised". It would be hilarious if it were not so sad.

One marvellous quality of objective truth, of science, is that it does not change. Subjective "truth", non-science, changes all the time. One can be relied upon, it must be understood before a practice is undertaken, the other is not worth reading about.

Isolation

They live in a world of their own, isolated from the reality of Relational Databases, specifically the Relational Model, and the industry that they allege to serve. In forty five years since the RM came out, they have done nothing to progress the RM or Relational databases.

Mind you, they have been progressing all sorts of notions, which are outside the Relational Model.
The progress of the RM (completion of what the Neanderthals suggest was "incomplete") has happened solely due to the standardisation (R Brown and others working with Codd, resulting in the IDEF1X Standard for Modelling Relational Databases), and the efforts of high-end SQL vendors and their customers.
That is the commercial RDBMS vendors, who were already established in the 1980's, not the Non-sql freeware/shareware/vapourware groups of the last decade, who pass off their wares as "sql", which gets you good and glued to their "platform", non-portable.

The worst part is, they publish books about their non-relational concepts, and fraudulently label them as "relational". And "professors" blindly "teach" this nonsense, like parrots, without ever understanding either the nonsense, or the Relational Model that it is supposed to explore.

If you are trying to find answers to some "educational" project, sorry, I cannot provide that, because the "education", as you can see, is totally confused, and has non-relational requirements.
I can however, provide direct answers to the question, governed by science, the Relational Model, the laws of physics, etc.

The point to take from this is, while Relational Theory and Practice were very close after Dr E F Codd published his seminal work, and during the time that the SQL Platforms were developed by the vendors, in the post-Codd era, what passes for "relational theory" is completely divorced from that original Relational Theory.

I can enumerate the differences, but not here. Note that if you read my posts that touch on this subject, you can gather those particulars, and enumerate them yourself. Or else ask a new question.

The Question

I wish to know how to correctly set up Primary Keys in a Relation. E.g. we have ER-diagram which contain elements:

There is no ERD to examine. Ok, in the Update you have an example. Perfect for your questions, because it is a set of user views of the data, and the modelling can now begin. But note, that is not an ERD or a Model. We rely on understanding the data; analysing it; classifying it, not on looking at the data values with a microscope. I realise that that is what you have been taught to do.

In order to translate it into Relational Model

Yes, that is the stated goal. The word "translate" is incorrect, because the RM is not merely a flat or fixed set of criteria that one "satisfies" or fits into (as it is known to the "theoreticians"), it also provides specific Methods and Rules. Therefore, we will be Modelling, according to the Relational Model.

we should do some tricks.

We don't need tricks, we use science, and only science. The "theoreticians" and the "professors" who follow them, need tricks, and practice non-science. I can't help in that regard. Further, the tricks they use, are usually to circumvent and subvert the Relational Model, so watch out for them.

Surrogate

All elements above deal with Primary Keys of relations but they all are Natural Keys - so we can leave them as is or replace with Surrogate Keys.

Well, there it is, your "teacher's" first trick is exposed.

Surrogates are physical Record (not row) pointers, they are not logical.
There is no such thing as a "surrogate key", the two words contradict each other.
- A Key has a specific definition in the RM, it has to be made up from the data. A surrogate isn't made up from the data, it is manufactured, a meaningless number generated by the system. Therefore it is not a Key or a "key".
- A Key in the RM has has a number of Relational qualities, which makes Keys very powerful. Since a surrogate is not a Key, it does not have any of those qualities, it has no Relational power.
- Therefore, "surrogate" and Key each have specific meanings, and they are quite fine as separate terms, but together, they are self-contradictory, because they are opposites.
- When people use them term "surrogate key", they naturally expect some, if not all, the qualities of a Key. But they will not obtain any of them. Therefore they are defrauded.
The Relational Model (the one that the theoreticians know nothing about) has a specific Access Path Independence Rule. As long as Relational Keys are used, this rule is maintained. It provides Relational Integrity¹.
- The use of a surrogate violates this rule. The consequence² is, Relational Integrity and Relational Navigation³ are both lost.
- The consequence of that is, many more joins are required to get at the same data (not less, as the lovers of mythology and magic keep parroting).
- Therefore surrogates are not permitted, on another, separate count.
Since you are in the modelling stage, either conceptual or logical, and Keys are Logical, and surrogates are physical, surrogates should not come into the picture. (They come into the picture, if at all, for consideration, only when the logical model is complete, and the physical model is being considered.) You are nowhere near completion of the Logical, so the introduction of a surrogate should raise a red flag.

The "teacher", and the author of the "textbook" that he is using, are frauds, on two separate counts:
- They are introducing a physical field, into the Logical exercise, which should not concern itself with physical aspects of the database.
- But in so doing, the effect they have is that they establish the surrogate, the physical thing, as a logical thing. Thus they poison the mind.

There, straight science, pure logic, uncontaminated by insane thinking, and thus immune to the frauds. No surrogates at the Logical stage.

^{1. Relational Integrity (which the Relational Model provides) is distinctly different to Referential Integrity (which SQL provides, and Record Filing Systems might have). If you do not understand this, please open a new question "What is the difference ..." and ping me.}

^{2. Breaking any rule has always has undesirable consequences, beyond the act itself.}

^{3. If you do not understand this, please open a new question "What is the Relational Navigation ..." and ping me.}

So the final answer to your question:

All elements above deal with Primary Keys of relations but they all are Natural Keys - so we can leave them as is or replace with Surrogate Keys.

In the conceptual and logical exercise, we deal with Logical Keys only. Physical concepts such as a surrogate are illegal. The replacement of a Logical Key with a physical creature, in the Logical exercise is rejected. Use the Keys you have, which are from the data, and natural.

Not a "Replacement"

There is one more point. The term "replacement" is incorrect. A surrogate is never a replacement or substitute for a Natural Key.

One of the many qualities that a natural Key provides, is row uniqueness, and that too, is demanded in the Relational Model, duplicate rows are not permitted.
Since a surrogate is not a Key to a row (it is a physical pointer to a record), it cannot provide the required row uniqueness. If you do not fully understand what I am saying, please read this Answer, from the top to False Teachers. Do test the given code exercises.
Therefore, a surrogate, even if considered, at the physical modelling stage, is always an additional column and index. It is not a replacement for a natural Relational Key.
And conversely, if the surrogate is implemented as a replacement, the consequence is duplicate rows, a non-relational file, not a Relational table.

Case 1

Key Attribute is a name - so it must be of type CHAR or VARCHAR. Generally names become Key Attributes.

Yes.

Often they are codes (users do use codes). Often Codes jump out at you (you have a very good example in your One More Update). { D | R | B } would do just as well { < | ^ | > }. This is of course towards the end of the logical model stage, when the model is stable, and one is finalising the Keys and optimising them. For any stage earlier than that, the wide Natural Keys stand.

The idea is to keep it meaningful.

Keys have meaning (surrogates have no meaning). One of the qualities of a Relational Key is, that that meaning is carried, wherever the Key is migrated as a Foreign Key.

And as per your example, wherever it is used. Including program code. Writing:

 IF CrewType = "Backup"  -- meaningful but fixes a value
 IF CrewType = 1         -- meaningless

is just plain wrong. Because (a) that is not really a Key, and (b) the user may well change the value of that datum from Backup to Reserve, etc. Never write code that addresses a data value, a descriptor. So the fact is, Backup is the projection of the Key, the exposition, and the code is the Key. That resolves to CrewType.Name, and the Key is CrewTypeCode.

     IF CrewTypeCode = "B"   -- Key, meaningful, not fixed

While we are on Keys, please note:

In the Relational Model, we have Primary Keys, Alternate Keys, and Foreign Keys (migrated Primary Keys).
We do not have "candidate keys", no such thing is defined in the RM. It is something manufactured outside the RM. It is therefore non-relational.

Worse, they are used by people who implement surrogates as "primary keys"^a.
A physical consideration ^b, but one that should be understood and applied throughout the exercise. When the data is understood and known, the columns will be fixed length. When they are unknown, they might be variable. For Keys, given that they will be indexed, at least on the Primary side, they should never be variable, because that requires unpacking on every access.

^{a. The use the SQL keyword PRIMARY KEY does not magically transform a surrogate into a PK. If one follows the RM, one (a) determines the possible Keys (no surrogates), and then (b) chooses one as Primary, which (c) means the election is over, therefore (d) the nominated candidates can no longer be called "candidates", the event is history, therefore (e) the remainder, the non-primary Keys, are Alternate Keys.}

^{"Candidate key" is a refusal to conform to the RM and nominate a PK, therefore, in and of itself, it is non-relational. Separate to the fact that they have a surrogate as "primary key", which is a second non-relational item.}

^{b. For those non-technical people who believe that no technical knowledge and foresight, no physical considerations at all, should be evaluated during the logical, that's fine, evaluate them at the physical. Since I am not addressing the physical here, I am just making a note for Umbra.}

Magicians rely on their tricks, to make bunny rabbits look like lions. Scientists do not need them.

Case 2

Two (or more) Identifying Relationships become a Composite Primary Key of a relation (which is made of Foreign Keys).

I think you have the right idea, but the wording is incorrect for the generic case.

That wording is correct for an Associative Table, which has two Foreign Keys. Yes, in that case, the two FKs form the PK, which is all that is needed for row uniqueness. Nothing can better that. The addition of a Record ID is superfluous.
For the generic case, for any table:
- An Identifying Relationship¹ causes the FK (migrated parent PK) to be part of the PK in the child. Hence the name, the parent Identifies the child.
- That makes the child a Dependent¹ table, meaning that the child rows can exist only in the context of a parent row. Such tables form the intermediate and leaf nodes in the Data Hierarchies, they are the majority of tables in a Relational database.
- If the row can exist independently, the table is Independent¹. Such tables form the top of each Data Hierarchy, there are very few in a Relational database.
- A Non-identifying Relationship¹ is one where the FK (migrated parent PK), is not used to form the child PK.
- Compound or Composite Keys are simply made up of more than one column, they are standard fare in Relational databases. Every table except the top of each Data Hierarchy will have a Compound Key. If you do not have any, the database is not Relational.

Please read my IDEF1X Introduction carefully.

^{1. The "theoreticians" do not differentiate Identifying vs Non-identifying, or Dependent vs Independent: all their files are Independent; all their "relationships" between record pointers are Non-identifying. It is a regression to the pre-1970's ISAM Record Filing Systems, devoid of Relational Integrity, power, and speed. That is all they understand, that is all they can teach. Fraudulently labelled as "relational".}

Case 3

Identifying Relationship(s) with Weak Key Attribute(s) also become a Composite Primary Key.

The term "weak" with or without a relationship to "key" is not defined in the Relational Model. It is a fiction of the "theoreticians". Thus I cannot answer that question.

I do note that some of the "theoretical" papers present strong Keys (normal English word, describing the fact that the Key has been established previously) as "weak", and weak "keys" (normal English word, describing the fact that the "key" has not been established previously) as "strong". Such is the nature of schizophrenia.
Therefore I suspect that it is part and parcel of their evidenced attempt to confuse the science with non-science, and to undermine the Relational Model. In the old days, when such people were locked up, humanity was healthly. Now they write books and teach in colleges.

Case 4

Associative entities usually have two or more Identifying Relationships

Yes. Two is correct.

If you have more than two, then that is not fully Normalised. Codd gives an explicit method to Normalise that, such that there will be two (or more) Associative entities, of two exactly Identifying relationships each.

"... therefore, all n-ary (more than two) relations ... can be ... and should be, resolved to binary (two) relations."
(paraphrased for this context)

so they are to be Junction Relations (Junction Tables).

No. "Junction" relations and "junction" tables are not defined in the Relational Model, therefore they are non-relational.

Associative Entities in the logical become Associative Tables in the physical.

Answer Too Long

The completion of the answer exceeded the limit for SO answers. Therefore I have placed the Answer in a single document, and provided a link. Splitting the Answer at this point proved to be a sin, thus the document contains the entire answer, with consistent formatting, etc:

Complete Answer

To continue from this point (ie. the SO Answer text, above), simply scroll down to the Case 4 heading.
There is a value in retaining the above SO Answer text, not only for historical purposes, but for text searches, etc.

回答2:

Your list of the advantages and disadvantages of using surrogate keys is a good one. As that list suggests, this topic is a complex one. And there is no uniform consensus among database designers about when surrogate keys are indicated or contraindicated. Even in this Q&A area, you will find wildly varying opinions on this subject.

I quibble with your listing of "meaningless" as a disadvantage to using surrogate keys. In many circumstances, the fact that they are meaningless is an advantage, not a disadvantage. In particular, many natural keys invented by people are not "atomic". That is, they contain multiple attributes encoded inside the key.

For example, is it is possible, given the VIN (Vehicle Identification Number) of a vehicle, to determine the passenger seating capacity of the vehicle. But that's the seating capacity as originally manufactured, and not necessarily the seating capacity at the present point in time. Since the VIN ought to be immutable, it can't be changed when one seat is ripped out. And it's now misleading.

So many teachers of database design advocate meaningless keys.

There are a couple of disadvantages of names used as natural keys that you did not mention. They are often not unique, and they are often mutable, as used by humans. For example, there could be two students named Mary Jones at a university. Or Mary Jones might change her name to Mary Smith, part way through a semester.

There is another disadvantage you didn't mention. It's misdirected data, including fraud. If SSN is used to identify employees, we have to guard against an employee giving us a fake SSN, and then later hiring a person that really owns that SSN. The database is in real trouble at that point.

This answer only touches on a few aspects of a very large topic. I suggest further reading, from authors like CJ Date.

来源：https://stackoverflow.com/questions/30639562/how-to-set-up-primary-keys-in-a-relation

标签

database

database-design

relational-database

database-schema

entity-relationship