A SQL query searching for rows that satisfy Column1 <= X <= Column2 is very slow

后端 未结 12 1329
盖世英雄少女心
盖世英雄少女心 2021-01-11 16:27

I am using a MySQL DB, and have the following table:

CREATE TABLE SomeTable (
  PrimaryKeyCol BIGINT(20) NOT NULL,
  A BIGINT(20) NOT NULL,
  FirstX INT(11) N         


        
12条回答
  •  攒了一身酷
    2021-01-11 17:02

    Eran, I believe the solution you found youself is the best in terms of minimum costs. It is normal to take into account distribution properties of the data in the DB during optimization process. Moreover, in large systems, it is usually impossible to achieve satisfactory performance, if the nature of the data is not taken into account.

    However, this solution also has drawbacks. And the need to change the configuration parameter with every data change is the least. More important may be the following. Let's suppose that one day a very large range appears in the table. For example, let its length cover half of all possible values. I do not know the nature of ​​your data, so I can not definitely know if such a range can ever appear or not, so this is just an assumption. From the point of view to the result, it's okay. It just means that about every second query will now return one more record. But even just one such interval will completely kill your optimization, because the condition FirstX <=? AND FirstX> =? - [MAX (LastX-FirstX)] will no longer effectively cut off enough records.

    Therefore, if you do not have assurance if too long ranges will ever come, I would suggest you to keep the same idea, but take it from other side. I propose, when loading new data to the table, break all long ranges into smaller with a length not exceeding a certain value. You wrote that The important columns of this table are FirstX, LastX, Y, Z and P. So you can once choose some number N, and every time loading data to the table, if found the range with LastX-FirstX > N, to replace it with several rows:

    FirstX; FirstX + N
    FirstX + N; FirstX + 2N
    ...
    FirstX + kN; LastX
    

    and for the each row, keep the same values ​​of Y, Z and P.

    For the data prepared that way, your query will always be the same:

    SELECT P, Y, Z FROM SomeTable WHERE FirstX <=? AND FirstX> =? - N AND LastX> =?
    

    and will always be equally effective.

    Now, how to choose the best value for N? I would take some experiments with different values and see what would be better. And it is possible for the optimum to be less than the current maximum length of the interval 4200000. At first it could surprise one, because the lessening of N is surely followed by growth of the table so it can become much larger than 4.3 million. But in fact, the huge size of the table is not a problem, when your query uses the index well enough. And in this case with lessening of N, the index will be used more and more efficiently.

提交回复
热议问题