A SQL query searching for rows that satisfy Column1 <= X <= Column2 is very slow

后端未结

关注

 12  1331

I am using a MySQL DB, and have the following table:

CREATE TABLE SomeTable (
  PrimaryKeyCol BIGINT(20) NOT NULL,
  A BIGINT(20) NOT NULL,
  FirstX INT(11) N


                      
              相关标签:


      
      
        
          12条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-11 17:10
              
            
            
                                                                       
Another approach is to precalculate the solutions, if that number isn't too big.

CREATE TABLE SomeTableLookUp (
    X INT NOT NULL
    PrimaryKeyCol BIGINT NOT NULL,
    PRIMARY KEY(X, PrimaryKeyCol)
);


And now you just pre-populate your constant table.

INSERT INTO SomeTableLookUp
SELECT X, PrimaryKeyCol
FROM SomeTable
JOIN (
   SELECT DISTINCT X FROM SomeTable 
) XS
WHERE XS.X BETWEEN StartX AND EndX 


And now you can SELECT your answers directly.

SELECT SomeTable.*
FROM SomeTableLookup
JOIN SomeTable
ON SomeTableLookup.PrimaryKeyCol = SomeTable.PrimaryKeyCol
WHERE SomeTableLookup = ?
LIMIT 10

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-01-11 17:11
              
            
            
                                                                       
I found a solution that relies on properties of the data in the table. I would rather have a more general solution that doesn't depend on the current data, but for the time being that's the best I have.

The problem with the original query:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;


is that the execution may require scanning a large percentage of the entries in the FirstX,LastX,P index when the first condition FirstX <= ? is satisfied by a large percentage of the rows.

What I did to reduce the execution time is observe that LastX-FirstX is relatively small.

I ran the query:

SELECT MAX(LastX-FirstX) FROM SomeTable;


and got 4200000.

This means that FirstX >= LastX – 4200000 for all the rows in the table.

So in order to satisfy LastX >= ?, we must also satisfy FirstX >= ? – 4200000.

So we can add a condition to the query as follows:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND FirstX >= ? - 4200000 AND LastX >= ? LIMIT 10;


In the example I tested in the question, the number of index entries processed was reduced from 2104820 to 18 and the running time was reduced from 0.563 seconds to 0.0003 seconds.

I tested the new query with the same 120000 values of X. The output was identical to the old query. The time went down from  over 10 hours to 5.5 minutes, which is over 100 times faster.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2021-01-11 17:12
              
            
            
                                                                       
Edit: Idea #2

Do you have control over the Java app?  Because, honestly, 0.3 seconds for an index scan is not bad.  Your problem is that you're trying to get a query, run 120,000 times, to have a reasonable end time.

If you do have control over the Java app, you could either have it submit all the X values at once - and let SQL not have to do an index scan 120k times.  Or you could even just program the logic on the Java side, since it would be relatively easy to optimize.

Original Idea:

Have you tried creating a Multiple-Column index?

The problem with having multiple indexes is that each index is only going to narrow it down to ~50% of the records - it has to then match those ~2 million rows of Index A against ~2 million rows of Index B.

Instead, if you get both columns in the same index, the SQL engine can first do a Seek operation to get to the start of the records, and then do a single Index Scan to get the list of records it needs.  No matching one index against another.

I'd suggest not making this the Clustered Index, though.  The reason for that?  You're not expecting many results, so matching the Index Scan's results against the table isn't going to be time consuming.  Instead, you want to make the Index as small as possible, so that the Index Scan goes as fast as possible.  Clustered Indexes are the table - so a Clustered Index is going to have the same Scan speed as the table itself.  Along the same lines, you probably don't want any other fields other than FirstX and LastX in your index - make that Index as tiny as you can, so that the scan flies along.

Finally, like you're doing now, you're going to need to clue the engine in that you're not expecting a large set of data back from the search - you want to make sure it's using that compact Index for its scan (instead of it saying, "Eh, I'd be better off just doing a full table scan.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2021-01-11 17:12
              
            
            
                                                                       
Suppose you got the execution time down to 0.1 seconds. Would the resulting 3 hours, twenty minutes be acceptable?

The simple fact is that thousands of calls to the same query is incredibly inefficient. Quite aside from what the database has to endure, there is network traffic to think of, disk seek times and all kinds of processing overhead.

Supposing that you don't already have the 120,000 values for x in a table, that's where I would start. I would insert them into a table in batches of 500 or so at a time:

insert into xvalues (x)
select 14 union all
select 18 union all
select 42 /* and so on */


Then, change your query to join to xvalues.

I reckon that optimisation alone will get your run-time down to minutes or seconds instead of hours (based on many such optimisations I have done through the years).

It also opens up the door for further optimisations. If the x values are likely to have at least some duplicates (say, at least 20% of values occur more than once) it may be worth investigating a solution where you only run the query for unique values and do the insert into SomeTable for every x with the matching value.

As a rule: anything you can do in bulk is likely to exponentially outperform anything you do row by row.

PS:

You referred to a query, but a stored procedure can also work with an input table. In some RDBMSs you can pass a table as parameter. I don't think that works in MySQL, but you can create a temporary table that the calling code fills in and the stored procedure joins to. Or a permanent table used in the same way. The major drawback of not using a temp table, is that you may need to concern yourself with session management or discarding stale data. Only you will know if that is applicable to your case.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2021-01-11 17:16
              
            
            
                                                                       
To optimize this query:

SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;

Here's 2 resources you can use:


descending indexes
spatial indexes


Descending indexes:

One option is to use an index that is descending on FirstX and ascending on LastX.

https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html

something like:

CREATE INDEX SomeIndex on SomeTable (FirstX DESC, LastX);

Conversely, you could create instead the index (LastX, FirstX DESC).

Spatial indexes:

Another option is to use a SPATIAL INDEX with (FirstX, LastX). If you think of FirstX and LastX as 2D spatial coordinates, then your search what it does is select the points in a contiguous geographic area delimited by the lines FirstX<=LastX, FirstX>=0, LastX>=X.

Here's a link on spatial indexes (not specific to MySQL, but with drawings):

https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2021-01-11 17:18
              
            
            
                                                                       
One way might be to partition the table by different ranges then only querying stuff that fit into a range hence making the amount it needs to check much smaller. This might not work since the java may be slower. But it might put less stress on the database. 
There might be a way also to not Query the database so many times and have a more inclusive SQL(you might be able to send a list of values and have the sql send it to a different table).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复