问题
I come to you because i cannot fix an issues with pandas.DataFrame.to_sql() method.
I've made the connection between my script and my database, i can send queries, but actually it's too slow for me.
I would like to find a way to improve the performance of my script on this. Maybe someone will find a solution?
Here is my code :
engine = sqlalchemy.create_engine(con['sql']['connexion_string'])
conn = engine.connect()
metadata = sqlalchemy.Metadata()
try :
if(con['sql']['strategy'] == 'NEW'):
query = sqlalchemy.Table(con['sql']['table'],metadata).delete()
conn.execute(query)
Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
elif(con['sql']['strategy'] == 'APPEND'):
Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
else:
pass
except Exception as e:
print(type(e))
It's working and too slow when i retire chunksize and method parameters,it's this moment where it's too slow (almost 3 minutes for 30 thousand lines). When i put these parameters, i get an sqlalchemy.exc.ProgrammingError...
thanks for your help !
回答1:
For mssql+pyodbc
you will get the best performance from to_sql
if you
- use Microsoft's ODBC Driver for SQL Server, and
- enable
fast_executemany=True
in yourcreate_engine
call.
For example, this code runs in just over 3 seconds on my network:
from time import time
import pandas as pd
import sqlalchemy as sa
ngn_local = sa.create_engine("mssql+pyodbc://mssqlLocal64")
ngn_remote = sa.create_engine(
(
"mssql+pyodbc://sa:_whatever_@192.168.0.199/mydb"
"?driver=ODBC+Driver+17+for+SQL+Server"
),
fast_executemany=True,
)
df = pd.read_sql_query(
"SELECT * FROM MillionRows WHERE ID <= 30000", ngn_local
)
t0 = time()
df.to_sql("pd_test", ngn_remote, index=False, if_exists="replace")
print(f"{time() - t0} seconds")
whereas with fast_executemany=False
(which is the default) the same process takes 143 seconds (2.4 minutes).
回答2:
I synthesised a dataframe with 36k rows. This always inserts in < 1.5s. Also a dumb select
with expensive where
clause and a group by
which gets marginally slower as table grows but always < 0.5s
- no indexes so inserts are fast
- no indexes to help selects
- running on mariadb inside a docker container on my laptop. So not all all optimised
- all defaults which is performing well
More information
- what indexes do you have on your table? rule of thumb fewer is better. More indexes, slower inserts
- what timings do you see from this synthesised case?
import numpy as np
import pandas as pd
import random, time
import sqlalchemy
a = np.array(np.meshgrid([2018,2019,2020], [1,2,3,4,5,6,7,8,9,10,11,12],
[f"Stock {i+1}" for i in range(1000)],
)).reshape(3,-1)
a = [a[0], a[1], a[2], [round(random.uniform(-1,2.5),1) for e in a[0]]]
df1= pd.DataFrame({"Year":a[0], "Month":a[1], "Stock":a[2], "Sharpe":a[3], })
temptable = "tempx"
engine = sqlalchemy.create_engine('mysql+pymysql://sniffer:sniffer@127.0.0.1/sniffer')
conn = engine.connect()
try:
# conn.execute(f"drop table {temptable}")
pass
except sqlalchemy.exc.OperationalError:
pass # ignore drop error if table does not exist
start = time.time()
df1.to_sql(name=temptable,con=engine, index=False, if_exists='append')
curr = conn.execute(f"select count(*) as c from {temptable}")
res = [{curr.keys()[i]:v for i,v in enumerate(t)} for t in curr.fetchall()]
print(f"Time: {time.time()-start:.2f}s database count:{res[0]['c']}, dataframe count:{len(df1)}")
curr.close()
start = time.time()
curr = conn.execute(f"""select Year, count(*) as c
from {temptable}
where Month=1
and Sharpe between 1 and 2
and stock like '%%2%%'
group by Year""")
res = [{curr.keys()[i]:v for i,v in enumerate(t)} for t in curr.fetchall()]
print(f"Time: {time.time()-start:.2f}s database result:{res} {curr.keys()}")
curr.close()
conn.close()
output
Time: 1.23s database count:360000, dataframe count:36000
Time: 0.27s database result:[{'Year': '2018', 'c': 839}, {'Year': '2019', 'c': 853}, {'Year': '2020', 'c': 882}] ['Year', 'c']
来源:https://stackoverflow.com/questions/63171038/improve-pandas-to-sql-performance-with-sql-server