Using Dask's NEW to_sql for improved efficiency (memory/speed) or alternative to get data from dask dataframe into SQL Server Table

问题

My ultimate goal is to use SQL/Python together for a project with too much data for pandas to handle (at least on my machine). So, I have gone with dask to:

read in data from multiple sources (mostly SQL Server Tables/Views)
manipulate/merge the data into one large dask dataframe table of ~10 million+ rows and 52 columns, some of which have some long unique strings
write it back to SQL Server on a daily basis, so that my PowerBI report can automatically refresh the data.

For #1 and #2, they take ~30 seconds combined to execute using minimal memory (several SQL queries ~200 lines of code manipulating a large dataset with dask). Fast and Fun!!!

But, #3 above has been the main bottleneck. What are some efficient ways in terms of (1. Memory and 2. Speed (time to execute)) to accomplish #3 with dask or other alternatives? See some more background, as well as what I have tried and some conclusions I have come to.

For #1, #2 and #3 above, this has been a task that I have found impossible to do with pandas due to memory limitations/long execution time, but dask solved #1 and #2 above with flying colors, but I was still struggling with #3 -- getting the data back into a SQL table in an automated way where I didn't send to a .csv and then import into SQL Server. I tried .compute() to transform the dask dataframe to a pandas dataframe and then write to_sql, but that kind of defeated the purpose of using dask to read/data model and again was running out of memory/taking forever to execute anyway.

So, the new plan was to use to_csv to generate a new .csv daily and use a query to bulk insert the data into a table. I think this is still a viable solution; but, today, I was VERY happy to find out that dask released a new to_sql function (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_sql). Leveraging existing StackOverflow articles/blogs about this topic (e.g. from Francois Leblanc - https://leblancfg.com/benchmarks_writing_pandas_dataframe_SQL_Server.html), I tinkered with all of the parameters to find the most efficient combination that had the fastest time to execute (which matters A LOT when you are writing large datasets every single day for Reporting). This is what I found, which is similar to a lot of posts about pd.to_sql including Leblanc's:

import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
#windows authentication + fast_executemany=True
to_sql_uri = sa.create_engine(f'mssql://@{server}/{database}?trusted_connection=yes&driver={driver_name}', fast_executemany=True)
ddf.to_sql('PowerBI_Report', uri=to_sql_uri, if_exists='replace', index=False)

Using any combination of the following non-default parameters slowed down the time-to-execute for my to_sql (once again in agreement with what LeBlanc mentioned in his blog):

chunksize=40 (40 is the max I could pass for 52 columns per the the 2098 SQL Server parameter limit),
method='multi',
parallel=True)

Note: I realized that in addition to (or in replacement of) passing chunksize=40, I could have looped through my 33 dask dataframe partitions and processed each chunk to_sql individually. This would have been more memory efficient and may have also been quicker. One partition was taking 45 seconds to 1 minute, while doing the whole dask dataframe at once took > 1 hour for all partitions. I will try looping through all partitions and post an update if that was faster. An hour seems like a lot, but I felt completely blocked when trying to compute with pandas, which took all night or ran out of memory, so this is a STEP UP. Honestly, I'm happy enough with this an am probably going to build an .exe now with pyinstaller and have the .exe run daily, so that this is fully automated and go from there, but I thought this would be helpful for others, as I have struggled with various solutions over the past couple weeks.

回答1:

I tested writing the dataframe to SQL Server in partitions by looping through them, versus all at once, and the time to complete everything was similar to writing everything all at once.

import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
#windows authentication + fast_executemany=True
to_sql_uri = sa.create_engine(f'mssql://@{server}/{database}?trusted_connection=yes&driver={driver_name}', fast_executemany=True)
# From my question, I have replaced the commented out line of code with everything below that to see if there was a significant increase in speed. There was not. It was about the same as the cod in the question.
# ddf.to_sql('PowerBI_Report', uri=to_sql_uri, if_exists='replace', index=False)
i = 0
for i in range(ddf.npartitions):
    partition = ddf.get_partition(i)
    if i == 0:
        partition.to_sql('CDR_PBI_Report', uri=to_sql_uri, if_exists='replace', index=False)
    if i > 0:
        partition.to_sql('CDR_PBI_Report', uri=to_sql_uri, if_exists='append', index=False)
    i += 1

回答2:

Choosing to insert dask dataframes as partitions shouldn't speed up the total time needed for the inserting process.

Every time you call insert, no matter if there is a partition or whole data to insert the .compute() method is called to extract data from memory and use it, and you cannot optimize it through this. I really doubt that is necessary to extract partitions, I think behind the method to_sql() dask uses that approach already.

来源：https://stackoverflow.com/questions/62404502/using-dasks-new-to-sql-for-improved-efficiency-memory-speed-or-alternative-to

标签

sql-server

pandas

sqlalchemy

dask

dask-to-sql