How do I not normalize continuous data (INTS, FLOATS, DATETIME, …)?

冷暖自知 提交于 2019-12-31 04:48:07

问题


According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing

However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)

"Normalize, but don't over-normalize." In particular, do not normalize datetimes or floats or other "continuous" values.

source

Sure purists say normalize time. That is a big mistake. Generally, "continuous" values should not be normalized because you generally want to do range queries on them. If it is normalized, performance will be orders of magnitude worse.

Normalization has several purposes; they don't really apply here:

  • Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings

  • To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each time was independently assigned, and you are not a Time Lord.

  • In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still sucks.

source

Do not normalize "continuous" values -- dates, floats, etc -- especially if you will do range queries.

source.

I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.

Note: I could have asked him in a comment or something, but I wanted to document and highlight this point alone because I believe this is very important note that affect almost my entire database performance


回答1:


The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?

Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:

-- single table
SELECT ...
    FROM t
    WHERE x = ...
      AND date BETWEEN ... AND ...;   -- `date` is of datatype DATE/DATETIME/etc

-- extra table
SELECT ...
    FROM t
    JOIN Dates AS d  ON t.date_id = d.date_id
    WHERE t.x = ...
      AND d.date BETWEEN ... AND ...;  -- Range test is now in the other table

Moving the range test to a JOINed table causes the slowdown.

The first query is quite optimizable via

INDEX(x, date)

In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)

DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).

Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.



来源:https://stackoverflow.com/questions/49751952/how-do-i-not-normalize-continuous-data-ints-floats-datetime

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!