Does SQLite optimize a query with multiple AND conditions in the WHERE clause?

后端未结

关注

 4  2119

In SQL databases (I use Python+Sqlite), how to make sure that, if we have 1 million rows, the query

SELECT * FROM mytable WHERE myfunction(description) <


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2021-01-14 06:21
              
            
            
                                                                       
One way that you can force the order of execution is using a case expression.  In general, SQL optimizers can re-arrange operations, the one exception is case.

SELECT *
FROM mytable
WHERE (CASE WHEN column2 >= 1000  OR column2 IS NULL THEN 0
            WHEN myfunction(description) < 500 THEN 1
       END) = 1;


Generally, case expressions are discouraged in WHERE clauses . . . one major reason is that they impede optimization.  In this case, that is a good thing.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2021-01-14 06:26
              
            
            
                                                                       
SQLite will happily reorder AND-connected expressions whenever it feels like it. So while rewriting the query to check column2 first appears to work in the current version, there is no guarantee.

The query optimizer assumes that speed is determined mainly by disk I/O, so it estimates the cost of both conditions to be the same.
Cost estimates are influenced by indexes, and by ANALYZE statistics (which work only for indexed data).
So the easiest way to speed up this query (and probably most other queries you will use) is to create an index on column2:



CREATE INDEX my_little_index ON mytable(column2);


If you do not want to use an index for some reason, you have to use a construct that the query optimizer cannot optimize away. A CASE expression as shown in Gordon's answer would work just fine. In the general case, move the first condition into a subquery and prevent subquery flattening by breaking one of the listed rules; adding a dummy LIMIT clause to both queries usually is easiest:

SELECT *
FROM (SELECT *
      FROM mytable
      WHERE column2 < 1000
      LIMIT -1)
WHERE myfunction(description) < 500
LIMIT -1;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2021-01-14 06:27
              
            
            
                                                                       
(Updated answer based on comments and subsequent testing.)

The actual answer to your question 


  how to make sure that, if we have 1 million rows, the query ... is optimized so that the 1st condition (CPU-expensive) is only tested if the easy-to-test second condition is already True?


depends on 


the actual conditions in the WHERE clause, and
how clever the SQLite query optimizer is in estimating the cost of those conditions. 


A simple test should tell you whether your query would be sufficiently "optimized" for your needs. The good news is that SQLite will perform the easy (inexpensive) condition first, at least under certain circumstances.

For a test table "mytable"

CREATE TABLE mytable (
    description TEXT(50) NOT NULL,
    column2 INTEGER NOT NULL,
    CONSTRAINT mytable_PK PRIMARY KEY (column2)
);


containing a million rows

description  column2
-----------  -------
row000000          0
row000001          1
row000002          2
...
row999999     999999


the Python test code

import sqlite3
import time

log_file_spec = r'C:\Users\Gord\Desktop\log_file.txt'

def myfunc(thing):
    with open(log_file_spec, 'a') as log:
        log.write('HODOR\n')
    return(int(thing[-6:]))


with open(log_file_spec, 'w'):
    pass  # just empty the file
cnxn = sqlite3.connect(r'C:\__tmp\SQLite\test.sqlite')
cnxn.create_function("myfunction", 1, myfunc)
crsr = cnxn.cursor()
t0 = time.time()
sql = """\
SELECT COUNT(*) AS n FROM mytable
WHERE myfunction(description) < 500 AND column2 < 1000
"""
crsr.execute(sql)
num_rows = crsr.fetchone()[0]
print(f"{num_rows} rows found in {(time.time() - t0):.1f} seconds")

cnxn.close()


returns

500 rows found in 1.2 seconds


and counting the lines in log_file.txt we see

C:\Users\Gord>find /C "HODOR" Desktop\log_file.txt

---------- DESKTOP\LOG_FILE.TXT: 1000


indicating that our function was only called one thousand times, not one million times. SQLite has clearly applied the column2 < 1000 first, and then applied the myfunction(description) < 500 condition on the subset of rows from the first condition.


(Original "off the cuff" answer.)

The actual answer to your question depends on how clever the query optimizer is. A simple test should tell you whether your query would be sufficiently "optimized" for your needs.

However, you do have a couple of options if your tests find that your original approach is too slow:

Option 1: Try doing the simple comparison "first"

Changing the order might affect the query plan, e.g.



... WHERE <easy_condition> AND <expensive_condition>


might turn out to be faster than

... WHERE <expensive_condition> AND <easy_condition> 


Option 2: Try forcing the order using a subquery

Again, depending on the cleverness of the query optimizer

SELECT easy.* 
FROM 
    (SELECT * FROM mytable WHERE column2 < 1000) easy
WHERE myfunction(easy.description) < 500


might apply the inexpensive condition first, then apply the expensive condition on the resulting subset of rows. (However, a comment indicates that SQLite is too sophisticated to fall for that ploy.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-01-14 06:29
              
            
            
                                                                       
Inspired by @GordThompson's answer, here is a benchmark between:
(1)  SELECT * FROM mytable WHERE col2 < 1000 AND myfunction(col1) < 500

vs.
(2)  SELECT * FROM mytable WHERE myfunction(col1) < 500 AND col2 < 1000


Test (1) (easy-to-test condition first): 1.02 seconds
import sqlite3, time, random

def myfunc(x):
    time.sleep(0.001) # wait 1 millisecond for each call of this function
    return x

# Create database
db = sqlite3.connect(':memory:')
db.create_function("myfunction", 1, myfunc)
c = db.cursor()
c.execute('CREATE TABLE mytable (col1 INTEGER, col2 INTEGER)');
for i in range(10*1000):
    a = random.randint(0,1000)
    c.execute('INSERT INTO mytable VALUES (?, ?)', (a, i));

# Do the evil query
t0 = time.time()
c.execute('SELECT * FROM mytable WHERE col2 < 1000 AND myfunction(col1) < 500')
for e in c.fetchall():
    print e
print "Elapsed time: %.2f" % (time.time() - t0)

Result: 1.02 seconds, it means that myfunc has been called max 1000 times, i.e. not for all the 10k rows.

Test (2) (Slow-to-compute condition first): 10.05 seconds
Idem with:
c.execute('SELECT * FROM mytable WHERE myfunction(col1) < 500 AND col2 < 1000')

instead.
Result: 10.05 seconds, it means that myfunc has been called ~ 10k times, i.e. for all the 10k rows, even those for which the condition col2 < 1000 is not True.

Global conclusion: Sqlite does lazy evaluation for AND, i.e. the easy condition has to be written first like this:
... WHERE <easy_condition> AND <expensive_condition>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复