A SQL query searching for rows that satisfy Column1 <= X <= Column2 is very slow

后端未结

关注

 12  1330

I am using a MySQL DB, and have the following table:

CREATE TABLE SomeTable (
  PrimaryKeyCol BIGINT(20) NOT NULL,
  A BIGINT(20) NOT NULL,
  FirstX INT(11) N


                      
              相关标签:


      
      
        
          12条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-01-11 16:58
              
            
            
                                                                       
So, I dont have enough data to be sure of the run time.  This will only work if column P is unique?  In order to get two indexes working, I created two indexes and the following query...

Index A - FirstX, P, Y, Z
Index B - P, LastX


This is the query 

select A.P, A.Y, A.Z 
from 
    (select P, Y, Z from asdf A where A.firstx <= 185000 ) A
    join 
    (select P from asdf A where A.LastX >= 185000 ) B
    ON A.P = B.P


For some reason this seemed faster than

select A.P, A.Y, A.Z 
from asdf A join asdf B on A.P = B.P
where A.firstx <= 185000 and B.LastX >= 185000

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2021-01-11 16:59
              
            
            
                                                                       
It seems that the only way to make the query fast is to reduce the number of fetched and compared fields. Here is the idea.

We can declare a new indexed field (for instance UNSIGNED BIGINT) and store both values FistX and LastX in it using an offset for one of the fields.

For example:

FirstX     LastX      CombinedX
100000     500000     100000500000
150000     220000     150000220000
180000     190000     180000190000
550000     660000     550000660000   
70000      90000      070000090000 
75         85         000075000085


an alternative is to declare the field as DECIMAL and store FirstX + LastX / MAX(LastX) in it.
Later look for the values satisfying the conditions comparing the values with a single field CombinedX.

APPENDED

And then you can fetch the rows checking only one field:
by something like where param1=160000

SELECT * FROM new_table 
WHERE
(CombinedX <= 160000*1000000) AND
(CombinedX % 1000000 >= 160000);


Here I assume that for all FistX < LastX. Of course, you can calculate the param1*offset in advance and store it in a variable against which the further comparisons will be done. Of course, you can consider not decimal offsets but bitwise shifts instead. Decimal offsets were chosen as they are easier to read by a human to show in the sample.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-01-11 17:02
              
            
            
                                                                       
Eran, I believe the solution you found youself is the best in terms of minimum costs. It is normal to take into account distribution properties of the data in the DB during optimization process. Moreover, in large systems, it is usually impossible to achieve satisfactory performance, if the nature of the data is not taken into account.

However, this solution also has drawbacks. And the need to change the configuration parameter with every data change is the least. More important may be the following. Let's suppose that one day a very large range appears in the table. For example, let its length cover half of all possible values. I do not know the nature of your data, so I can not definitely know if such a range can ever appear or not, so this is just an assumption. From the point of view to the result, it's okay. It just means that about every second query will now return one more record. But even just one such interval will completely kill your optimization, because the condition FirstX <=? AND FirstX> =? - [MAX (LastX-FirstX)] will no longer effectively cut off enough records.

Therefore, if you do not have assurance if too long ranges will ever come, I would suggest you to keep the same idea, but take it from other side.
I propose, when loading new data to the table, break all long ranges into smaller with a length not exceeding a certain value. You wrote that The important columns of this table are FirstX, LastX, Y, Z and P. So you can once choose some number N, and every time loading data to the table, if found the range with LastX-FirstX > N, to replace it with several rows:

FirstX; FirstX + N
FirstX + N; FirstX + 2N
...
FirstX + kN; LastX


and for the each row, keep the same values of Y, Z and P.

For the data prepared that way, your query will always be the same:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <=? AND FirstX> =? - N AND LastX> =?


and will always be equally effective. 

Now, how to choose the best value for N? I would take some experiments with different values and see what would be better. And it is possible for the optimum to be less than the current maximum length of the interval 4200000. At first it could surprise one, because the lessening of N is surely followed by growth of the table so it can become much larger than 4.3 million. But in fact, the huge size of the table is not a problem, when your query uses the index well enough. And in this case with lessening of N, the index will be used more and more efficiently.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2021-01-11 17:04
              
            
            
                                                                       
You need to add another index on LastX.

The unique index FirstLastXPriority_Index (FirstX,LastX,P) represents the concatenation of these values, so it will be useless with the 'AND LastX >= ?' part of your WHERE clause.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2021-01-11 17:06
              
            
            
                                                                       
WHERE col1 < ... AND ... < col2 is virtually impossible to optimize.

Any useful query will involve a "range" on either col1 or col2.  Two ranges (on two different columns) cannot be used in a single INDEX.

Therefore, any index you try has the risk of checking a lot of the table:
INDEX(col1, ...) will scan from the start to where col1 hits ....  Similarly for col2 and scanning until the end.

To add to your woes, the ranges are overlapping.  So, you can't pull a fast one and add ORDER BY ... LIMIT 1 to stop quickly.  And if you say LIMIT 10, but there are only 9, it won't stop until the start/end of the table.

One simple thing you can do (but it won't speed things up by much) is to swap the PRIMARY KEY and the UNIQUE.  This could help because InnoDB "clusters" the PK with the data.

If the ranges did not overlap, I would point you at http://mysql.rjweb.org/doc.php/ipranges .

So, what can be done??  How "even" and "small" are the ranges?  If they are reasonably 'nice', then the following would take some code, but should be a lot faster.  (In your example, 100000     500000 is pretty ugly, as you will see in a minute.)

Define buckets to be, say, floor(number/100).  Then build a table that correlates buckets and ranges.  Samples:

FirstX  LastX  Bucket
123411  123488  1234
222222  222444  2222
222222  222444  2223
222222  222444  2224
222411  222477  2224


Notice how some ranges 'belong' to multiple buckets.

Then, the search is first on the bucket(s) in the query, then on the details.  Looking for X=222433 would find two rows with bucket=2224, then decide that both are OK.  But for X=222466, two rows have the bucket, but only one matches with firstX and lastX.

WHERE bucket = FLOOR(X/100)
  AND firstX <= X
  AND X <= lastX


with

INDEX(bucket, firstX)


But... with 100000     500000, there would be 4001 rows because this range is in that many 'buckets'.

Plan B (to tackle the wide ranges)

Segregate the ranges into wide and narrow.  Do the wide ranges by a simple table scan, do the narrow ranges via my bucket method.  UNION ALL the results together.  Hopefully the "wide" table would much smaller than the "narrow" table.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2021-01-11 17:06
              
            
            
                                                                       
Indexes will not help you in this scenario, except for a small percentage of all possible values of X.

Lets say for example that:


FirstX contains values from 1 to 1000 evenly distributed
LastX contains values from 1 to 1042 evenly distributed


And you have following indexes:


FirstX, LastX, <covering columns>
LastX, FirstX, <covering columns>


Now:


If X is 50 the clause FirstX <= 50 matches approximately 5% rows while LastX >= 50 matches approximately 95% rows. MySQL will use the first index.
If X is 990 the clause FirstX <= 990 matches approximately 99% rows while LastX >= 990 matches approximately 5% rows. MySQL will use the second index.
Any X between these two will cause MySQL to not use either index (I don't know the exact threshold but 5% worked in my tests). Even if MySQL uses the index, there are just too many matches and the index will most likely be used for covering instead of seeking.


Your solution is the best. What you are doing is defining upper and lower bound of "range" search:

WHERE FirstX <= 500      -- 500 is the middle (worst case) value
AND   FirstX >= 500 - 42 -- range matches approximately 4.3% rows
AND   ...


In theory, this should work even if you search FirstX for values in the middle. Having said that, you got lucky with 4200000 value; possibly because the maximum difference between first and last is a smaller percentage.



If it helps, you can do the following after loading the data:

ALTER TABLE testdata ADD COLUMN delta INT NOT NULL;
UPDATE testdata SET delta = LastX - FirstX;
ALTER TABLE testdata ADD INDEX delta (delta);


This makes selecting MAX(LastX - FirstX) easier.



I tested MySQL SPATIAL INDEXES which could be used in this scenario. Unfortunately I found that spatial indexes were slower and have many constraints.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复