Fastest way to load numeric data into python/pandas/numpy array from MySQL

前端 未结 2 1524
囚心锁ツ
囚心锁ツ 2021-01-31 22:22

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

MATLAB reference:

tic;
feature accel off;         


        
2条回答
  •  抹茶落季
    2021-01-31 22:58

    The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

    conversions[FIELD_TYPE.DECIMAL] = float
    conversions[FIELD_TYPE.NEWDECIMAL] = float
    

    instead of decimal.Decimal seems to completely solve the problem and now the following code:

    import MySQLdb
    import numpy
    import time
    
    t = time.time()
    conn = MySQLdb.connect(host='',...)
    curs = conn.cursor()
    curs.execute("select x,y from TABLENAME")
    data = numpy.array(curs.fetchall(),dtype=float)
    print(time.time()-t)
    

    Runs in less than a second! What is funny, decimal.Decimal never appeared to be the problem in the profiler.

    Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

    UPDATE

    Here is a solution not requiring to modify the MySQLdb source code: Python MySQLdb returns datetime.date and decimal The solution then to load numeric data into pandas:

    import MySQLdb
    import pandas.io.sql as psql
    from MySQLdb.converters import conversions
    from MySQLdb.constants import FIELD_TYPE
    
    conversions[FIELD_TYPE.DECIMAL] = float
    conversions[FIELD_TYPE.NEWDECIMAL] = float
    conn = MySQLdb.connect(host='',user='',passwd='',db='')
    sql = "select * from NUMERICTABLE"
    df = psql.read_frame(sql, conn)
    

    Beats MATLAB by a factor of ~4 in loading 200k x 9 table!

提交回复
热议问题