How do I know which is the more appropriate database design? (Authors, Articles & Comments)

后端未结

关注

 3  1915

心在旅途

Let\'s assume a database with three tables: Author, Articles, Comments

Assuming the relationship is as follows:

相关标签:

3条回答

青春惊慌失措

2021-01-17 05:46
Tables represent business/application relation(ship)s/associations. As in the relational model & entity-relationship modeling. Every query result holds the rows of values that are related by some business relationship expressed by the query expression.

Your "relationships" [sic] are FKs (foreign keys). Those are constraints--statements true in every business situation & its database state--saying that if some values are related by a certain business relationship then they are also related by a certain other one. But FKs are neither necessary nor sufficient for using the database--for interpreting it or updating it. They constrain the database state, but they don't tell you what's in it.

Your business relationships & corresponding tables are actually like:
```
Author authored Article
Commenter commented Comment re Article
```
Such a statement template denoting a business relationship is its (characteristic) predicate. To query using these it does not matter what the constraints are--if you want the authors who commented on articles authored by themselves that's
```
/* rows where
FOR SOME a.* & cr.*,
        Author = a.Author
    AND a.Author authored a.Article
    AND cr.Commenter commented cr.Comment re cr.Article
    AND a.Author = cr.Commenter
*/
select Author
from authored a join commented_re cr on a.Author = cr.Commenter
```
regardless of whether an author can author multiple articles, or multiple authors can author an article, or multiple authors can author multiple articles, or commenters can comment re multiple comments, etc, or commenters can comment re multiple articles, etc, or a comment can be re multiple articles, etc, or authors can comment, or commenters can author, or commenters can only comment on articles they authored (a FK constraint) or authors named 'henk' can comment re at most 7 articles, or any constraint whatsoever.

Normalization replaces a table by selects of it that join back to it, which is the same as saying it replaces a business relationship that is expressible via an AND by others that are expressible by the expressions that were ANDed. It happens that if an author can only write one article and an article can only be written by one author then the AND/join table above might (depending on other things) be a good design but otherwise it would not be a good design, and should be replaced by the separate tables. FDs & other constraints are the post-design table-based expression of corresponding business rules that follow from the chosen business relationships & what business situations can arise.

So your "scientific approach" is proper relational information modeling and database design, including normalization.
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2021-01-17 05:51
Your first approach is a normalized design. It should be the default - it's more maintainable, less error-prone, and requires less code overall.

The second option is a denormalized design. If you think it through, it would require you to find the author for the article every time someone posts a comment, and increment the "comments" field; that's probably more code, and makes writing the comment slower. It also means a simple bug in your "create comment" code could break the application logic, and you probably need to create a transaction for each comment "write" action so you can guarantee that both the comment and update to "authors.comment_count" succeeds or fails.

So, the second option is definitely more complex, and slower for writing comments. It may be faster for querying, but as you'll be joining on primary keys, you will almost certainly not be able to measure that performance impact until you get to a database size of hundreds of millions of records.

In general, I recommend the following approach; take each step only if the previous steps haven't given you enough performance.
- design a relational model.
- tune that relational database (indexes, etc.)
- improve the hardware - RAM, CPU, SSD disks etc.
- create a measurement rig so you can identify the performance challenges and run experiments. Create benchmarks based on current and expected data sizes; find a way to fill your test rig with dummy data until you have the data volume you need to scale to.
- run your queries on the test rig. Make sure there are no further performance tweaks from indexing or query optimization.
- introduce application-level caching. In your example, caching the number of comments for an author for 1 hour may be acceptable.
- de-normalize your schema. Use your test rig to prove it gives you the performance you expect.
- look at more exotic data solutions - sharding, data partitioning etc.
Denormalization is so far down the line because it introduces real maintenance risks, makes your code much more complex, and is nowhere near as effective as adding an extra 4GB to your server in most cases.
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2021-01-17 05:57

"But I don't remember where or why? Please link to a scientific approach to answer this question."

The "scientific approach" is the entire body of normalization theory.

The "redundant relationship" creates an additional problem in integrity enforcement. The system must make sure that the comment/author relationship as specified by a user updating the db, is the same as the one implied by the comment/article and article/author relationships.

That is a problem of additional complexity for the system when enforcing data integrity, and is a problem of additional complexity for the users doing the updating to ensure that they won't be specifying invalid updates.

So your "second approach" might make querying "simpler" indeed, but only at the expense of creating additional complexities on the "updating" side.

0 讨论(0)
发布评论:

提交评论
- 加载中...