Fastest way to load numeric data into python/pandas/numpy array from MySQL

前端未结

关注

 2  1530

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

MATLAB reference:

tic;
feature accel off;


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2021-01-31 22:54
              
            
            
                                                                       
Also check out this way of doing things using the turbodbc package. To transform your result set into an OrderedDict of NumPy arrays, just do this:

import turbodbc
connection = turbodbc.connect(dsn="My data source name")
cursor = connection.cursor()
cursor.execute("SELECT 42")
results = cursor.fetchallnumpy()


Transforming these results to a dataset should require a few additional milliseconds. I don't know the speedup for MySQL, but I have seen factor 10 for other databases.

The speedup is mainly achieved by using bulk operations instead of row-wise operations.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-01-31 22:58
              
            
            
                                                                       
The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float


instead of decimal.Decimal seems to completely solve the problem and now the following code:

import MySQLdb
import numpy
import time

t = time.time()
conn = MySQLdb.connect(host='',...)
curs = conn.cursor()
curs.execute("select x,y from TABLENAME")
data = numpy.array(curs.fetchall(),dtype=float)
print(time.time()-t)


Runs in less than a second!
What is funny, decimal.Decimal never appeared to be the problem in the profiler.

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

UPDATE

Here is a solution not requiring to modify the MySQLdb source code:
Python MySQLdb returns datetime.date and decimal
The solution then to load numeric data into pandas:

import MySQLdb
import pandas.io.sql as psql
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host='',user='',passwd='',db='')
sql = "select * from NUMERICTABLE"
df = psql.read_frame(sql, conn)


Beats MATLAB by a factor of ~4 in loading 200k x 9 table!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复