Let\'s assume a database with three tables: Author
, Articles
, Comments
Assuming the relationship is as follows:
Tables represent business/application relation(ship)s/associations. As in the relational model & entity-relationship modeling. Every query result holds the rows of values that are related by some business relationship expressed by the query expression.
Your "relationships" [sic] are FKs (foreign keys). Those are constraints--statements true in every business situation & its database state--saying that if some values are related by a certain business relationship then they are also related by a certain other one. But FKs are neither necessary nor sufficient for using the database--for interpreting it or updating it. They constrain the database state, but they don't tell you what's in it.
Your business relationships & corresponding tables are actually like:
Author authored Article
Commenter commented Comment re Article
Such a statement template denoting a business relationship is its (characteristic) predicate. To query using these it does not matter what the constraints are--if you want the authors who commented on articles authored by themselves that's
/* rows where
FOR SOME a.* & cr.*,
Author = a.Author
AND a.Author authored a.Article
AND cr.Commenter commented cr.Comment re cr.Article
AND a.Author = cr.Commenter
*/
select Author
from authored a join commented_re cr on a.Author = cr.Commenter
regardless of whether an author can author multiple articles, or multiple authors can author an article, or multiple authors can author multiple articles, or commenters can comment re multiple comments, etc, or commenters can comment re multiple articles, etc, or a comment can be re multiple articles, etc, or authors can comment, or commenters can author, or commenters can only comment on articles they authored (a FK constraint) or authors named 'henk' can comment re at most 7 articles, or any constraint whatsoever.
Normalization replaces a table by selects
of it that join
back to it, which is the same as saying it replaces a business relationship that is expressible via an AND
by others that are expressible by the expressions that were AND
ed. It happens that if an author can only write one article and an article can only be written by one author then the AND
/join
table above might (depending on other things) be a good design but otherwise it would not be a good design, and should be replaced by the separate tables. FDs & other constraints are the post-design table-based expression of corresponding business rules that follow from the chosen business relationships & what business situations can arise.
So your "scientific approach" is proper relational information modeling and database design, including normalization.
Your first approach is a normalized design. It should be the default - it's more maintainable, less error-prone, and requires less code overall.
The second option is a denormalized design. If you think it through, it would require you to find the author for the article every time someone posts a comment, and increment the "comments" field; that's probably more code, and makes writing the comment slower. It also means a simple bug in your "create comment" code could break the application logic, and you probably need to create a transaction for each comment "write" action so you can guarantee that both the comment and update to "authors.comment_count" succeeds or fails.
So, the second option is definitely more complex, and slower for writing comments. It may be faster for querying, but as you'll be joining on primary keys, you will almost certainly not be able to measure that performance impact until you get to a database size of hundreds of millions of records.
In general, I recommend the following approach; take each step only if the previous steps haven't given you enough performance.
Denormalization is so far down the line because it introduces real maintenance risks, makes your code much more complex, and is nowhere near as effective as adding an extra 4GB to your server in most cases.
"But I don't remember where or why? Please link to a scientific approach to answer this question."
The "scientific approach" is the entire body of normalization theory.
The "redundant relationship" creates an additional problem in integrity enforcement. The system must make sure that the comment/author relationship as specified by a user updating the db, is the same as the one implied by the comment/article and article/author relationships.
That is a problem of additional complexity for the system when enforcing data integrity, and is a problem of additional complexity for the users doing the updating to ensure that they won't be specifying invalid updates.
So your "second approach" might make querying "simpler" indeed, but only at the expense of creating additional complexities on the "updating" side.