Which database design gives better performance?

前端 未结 4 360
你的背包
你的背包 2021-01-17 06:18

I want to select to retrieve person and also further make some inserts, deletes and updates.

If I want retrieve person who liv

相关标签:
4条回答
  • 2021-01-17 06:48

    Depending on what you want to achieve. In both cases you are entering redundant data. It is called denormalization. Second example of denormalization is called "Short-Circuit Keys":

    Person(id, name, profession, **id_city**)   
    cities (id, city, **id_country**)   
    countries (id, country)
    

    The result of that can produce fewer table joins when queries execute. First example is called "The star schema":

    Person(id, name, profession, **id_country**, **id_city**)   
    cities (id, city, **id_country**)  
    countries (id, country)
    

    The star schema consists of one or more fact tables referencing any number of dimension tables. Star schema join logic is generally simpler than the join logic required to retrieve data from a highly normalized transactional schemas. This example is usually seen in data warehouse database design.

    There is no impact on performance in any case, you can choose one or the other to retrieve the data that you need.

    0 讨论(0)
  • 2021-01-17 06:55

    This question is very similar to the one you made yestarday:

    Create many tables or just one

    The answer is also similar - that depends on what you want to achieve. Both solutions could work and both have pros and cons and one should do a little trade-off analysis in the light of the specific situation. Out of this context is not possible to answer your question.

    The only difference I see in both version is foreign key id_country in Person table:

    Person(id, name, profession, ****id_country****, id_city)
    cities (id, city, id_country)
    countries (id, country)

    The question is "do we need it?"

    So, the pros and cons of both solutions:

    1. Solution: With id_contry:

    • pros: easier retrival of a Person based on land (simpler query) and better performance of this query
    • cons: more complex underlaying DB and more redundancy, more chance of inreoducing inconsiscenties in the DB, harder updates

    2. Solution: Without id_country:

    • pros: simpler and cleaner model, no redundancy, easier maintenance
    • cons: slower performance and more complex query for retrival of a Person based on land (simpler query)

    So, the 1st solution effectivelly gives you easier query structure and better performance for retrieving Persons by Country (what you wanted), but it has its cost (see pros and cons). On the other side, pragmatic thinking says that country-city data are quite stabile and not often changed and this fact goes in favor of the 1st solution.

    If this denormalization and slight chance of inconsistencies something you can live with, you can take the 1st solution.

    0 讨论(0)
  • 2021-01-17 06:56

    (Your original post did not address performance but normalization, but it has has many edits and "performance" was introduced, maybe because your comments have mentioned it.)

    One of the points of the relational model is generic querying with automated implementation with automated optimization. Ignore performance initially. Just make a straightforward design. (Before which you must learn how to make one.) Ids have nothing to do with performance. Normalization has something to do with performance but since you should normalize to 5NF first that is moot. Foreign keys have something to do with performance but since you should be defining them for integrity their role in performance is moot. Proper designs enable later tuning.

    Anyway performance is a tradeoff of factors and if you do not know what sorts of things you want to do it is meaningless to discuss performance. (Or if you do not even know what those sorts of things are.) Also, performance-related properties must be measured to even hold an opinion that manual optimization intervention is appropriate. (And again you would have to understand what those factors even are.)

    When performance becomes a demonstrated issue because a particular application makes particular patterns of queries or updates then you can address performance. First via indexing and views to make those patterns perform better--always at the expense of others.

    The sorts of things you are mentioning (and not mentioning) and the way you are mentioning them suggest that you have misconceptions about performance and its relation to design. Also that your understanding of relational structure, querying and DBMSs is extremely low. Until you learn much more about basic design any advice you get about biasing for performance is misdirected. So just forget about performance. The main thing that adversely affects performance is premature concern for performance getting in the way of a straightforward design.

    A simplest design is

    person(id, name, profession, city, country)
        -- person [id] is named [name] and practises [profession] in [city], [country]  
    city (name, country) -- [name] uniquely names a city within country [country]
    country (name) -- [name] uniquely names a country
    

    This has certain keys and FKs, just declare them--which has nothing to do with performance. It is in 5NF.

    You may come to understand that the following design (you can add relevant constraints) may be better for you than the previous one--which will have nothing to do with performance. Then you can move to it and offer the previous tables as views to old users--which will have nothing to do with performance.

    person(id, name, profession, id_city)
        -- person [id] is named [name] and practises [profession] in [id_city]  
    city (id, name, id_country) -- city [id] is named [name] and is in country [id_country]
    country (id, name) -- country [id] is named [name]
    

    Here id_country in person would violate 5NF since it would be functionally dependent on a non-key, id_city.

    0 讨论(0)
  • 2021-01-17 07:01

    Generally speaking, data integrity is more important than performance, so denormalize1 only if you have performed measurements on representative amounts of data and the results indicate a strong need for better performance. More often that not, a normalized schema will perform just fine, especially if you got your physical design right (such as indexing).

    In this particular case, my hunch would be that the second (normalized) design will work just fine.

    That being said, probably the most efficient design for a query: "which persons live in a given country" would look something like this:

    enter image description here

    And then cluster the PERSON on its PK. This way, all persons belonging to the same COUNTRY_ID are stored physically close together in the database, dramatically reducing the I/O for the query above2.

    On the other hand, you no longer use simple auto-increment mechanism to generate your CITY_NO and PERSON_NO, the secondary indexes on PERSON are expensive because of clustering leading to other queries becoming slower etc. So, this is not absolutely "better" than your second design, they just both have different strengths/weaknesses and you'll have to decide which is a better trade-off in your particular case.


    1 Which prevents the database from "self-defending" itself from bad data. In your case, the first (denormalized) design would allow a person to reference a country that does not contain the city referenced by the same person.

    2 I/O tends to be the biggest cost for most queries.

    0 讨论(0)
提交回复
热议问题