问题
I have a piece of Python code that works, but is very slow to write a Dataframe directly to an SFTP location. I am using pysftp and pandas.to_csv()
to achieve the task of reading an Excel file from a remote location, run a few simple transformations and write it over to an SFTP location.
The code snippet is shared below which, takes 4 minutes 30 seconds precisely, to write 100 records to the SFTP location. An average Dataframe that I process has a maximum of 20 columns.
def dataframe_sftp_transfer(df,destination_path):
cnopts = CnOpts()
cnopts.hostkeys = None
sftp = Connection('sftp3.server.com'
,username= 'user'
,password = 'pwd123'
,cnopts=cnopts)
with sftp.open(destination_path,'w+') as f:
chunksize = 100
with tqdm(total=len(df)) as progbar:
df.to_csv(f,sep='~',index=False,chunksize=chunksize)
progbar.update(chunksize)
Is there a better/faster way to achieve the aforesaid? Shouldn't writing files of the stated magnitude take only a couple of minutes?
Using a tool like FileZilla to put files in the remote SFTP location works much faster but, that sadly takes away any form of automation.
回答1:
You open the remote file without buffering. That way, every time the df.to_csv
writes to the file, Paramiko/pysftp sends a request to the SFTP server and waits for a response. I do not know internals of df.to_csv
, but it's likely it does one write per line (if not more). That would explain, why the upload is so slow. Particularly, if your connection to the server has high latency.
To enable buffered writes, use bufsize
parameter of Connection.open:
with sftp.open(destination_path, 'w+', 32768) as f:
Obligatory warning: Do not set cnopts.hostkeys = None
, unless you do not care about security. For the correct solution see Verify host key with pysftp.
来源:https://stackoverflow.com/questions/58111798/writing-to-a-file-on-sftp-server-opened-using-pysftp-open-method-is-slow