Improve pandas' to_sql() performance with SQL Server

拈花ヽ惹草 提交于 2021-01-28 06:31:22

问题


I come to you because i cannot fix an issues with pandas.DataFrame.to_sql() method.

I've made the connection between my script and my database, i can send queries, but actually it's too slow for me.

I would like to find a way to improve the performance of my script on this. Maybe someone will find a solution?

Here is my code :

  engine = sqlalchemy.create_engine(con['sql']['connexion_string'])
  conn = engine.connect()
  metadata = sqlalchemy.Metadata()
  try : 
    if(con['sql']['strategy'] == 'NEW'): 
      query = sqlalchemy.Table(con['sql']['table'],metadata).delete()
      conn.execute(query)
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    elif(con['sql']['strategy'] == 'APPEND'):
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    else:
      pass
  except Exception as e:
    print(type(e))

It's working and too slow when i retire chunksize and method parameters,it's this moment where it's too slow (almost 3 minutes for 30 thousand lines). When i put these parameters, i get an sqlalchemy.exc.ProgrammingError...

thanks for your help !


回答1:


For mssql+pyodbc you will get the best performance from to_sql if you

  1. use Microsoft's ODBC Driver for SQL Server, and
  2. enable fast_executemany=True in your create_engine call.

For example, this code runs in just over 3 seconds on my network:

from time import time
import pandas as pd
import sqlalchemy as sa

ngn_local = sa.create_engine("mssql+pyodbc://mssqlLocal64")
ngn_remote = sa.create_engine(
    (
        "mssql+pyodbc://sa:_whatever_@192.168.0.199/mydb"
        "?driver=ODBC+Driver+17+for+SQL+Server"
    ),
    fast_executemany=True,
)

df = pd.read_sql_query(
    "SELECT * FROM MillionRows WHERE ID <= 30000", ngn_local
)

t0 = time()
df.to_sql("pd_test", ngn_remote, index=False, if_exists="replace")
print(f"{time() - t0} seconds")

whereas with fast_executemany=False (which is the default) the same process takes 143 seconds (2.4 minutes).




回答2:


I synthesised a dataframe with 36k rows. This always inserts in < 1.5s. Also a dumb select with expensive where clause and a group by which gets marginally slower as table grows but always < 0.5s

  1. no indexes so inserts are fast
  2. no indexes to help selects
  3. running on mariadb inside a docker container on my laptop. So not all all optimised
  4. all defaults which is performing well

More information

  1. what indexes do you have on your table? rule of thumb fewer is better. More indexes, slower inserts
  2. what timings do you see from this synthesised case?
import numpy as np
import pandas as pd
import random, time
import sqlalchemy

a = np.array(np.meshgrid([2018,2019,2020], [1,2,3,4,5,6,7,8,9,10,11,12], 
                     [f"Stock {i+1}" for i in range(1000)],
                    )).reshape(3,-1)
a = [a[0], a[1], a[2], [round(random.uniform(-1,2.5),1) for e in a[0]]]
df1= pd.DataFrame({"Year":a[0], "Month":a[1], "Stock":a[2], "Sharpe":a[3], })


temptable = "tempx"

engine = sqlalchemy.create_engine('mysql+pymysql://sniffer:sniffer@127.0.0.1/sniffer')
conn = engine.connect()
try:
#     conn.execute(f"drop table {temptable}")
    pass
except sqlalchemy.exc.OperationalError:
    pass # ignore drop error if table does not exist
start = time.time()
df1.to_sql(name=temptable,con=engine, index=False, if_exists='append')
curr = conn.execute(f"select count(*) as c from {temptable}")
res = [{curr.keys()[i]:v for i,v in enumerate(t)} for t in curr.fetchall()]
print(f"Time: {time.time()-start:.2f}s database count:{res[0]['c']}, dataframe count:{len(df1)}")
curr.close()
start = time.time()
curr = conn.execute(f"""select Year, count(*) as c 
                        from {temptable} 
                        where Month=1 
                        and Sharpe between 1 and 2 
                        and stock like '%%2%%'
                        group by Year""")
res = [{curr.keys()[i]:v for i,v in enumerate(t)} for t in curr.fetchall()]
print(f"Time: {time.time()-start:.2f}s database result:{res} {curr.keys()}")
curr.close()
conn.close()

output

Time: 1.23s database count:360000, dataframe count:36000
Time: 0.27s database result:[{'Year': '2018', 'c': 839}, {'Year': '2019', 'c': 853}, {'Year': '2020', 'c': 882}] ['Year', 'c']


来源:https://stackoverflow.com/questions/63171038/improve-pandas-to-sql-performance-with-sql-server

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!