Slow upload of many small files with SFTP

核能气质少年 提交于 2020-12-30 03:00:21

问题


When uploading 100 files of 100 bytes each with SFTP, it takes 17 seconds here (after the connection is established, I don't even count the initial connection time). This means it's 17 seconds to transfer 10 KB only, i.e. 0.59 KB/sec!

I know that sending SSH commands to open, write, close, etc. probably creates a big overhead, but still, is there a way to speed up the process when sending many small files with SFTP?

Or a special mode in paramiko / pysftp to keep all the writes operations to do in a memory buffer (let's say all operations for the last 2 seconds), and then do everything in one grouped pass of SSH/SFTP? This would avoid to wait for the ping time between each operation.

Note:

  • I have a ~ 100 KB/s connection upload speed (tested 0.8 Mbit upload speed), 40 ms ping time to the server
  • Of course, if instead of sending 100 files of 100 bytes, I send 1 file of 10 KB bytes, it takes < 1 second
  • I don't want to have to run a binary program on remote, only SFTP commands are accepted
import pysftp, time, os
with pysftp.Connection('1.2.3.4', username='root', password='') as sftp:
    with sftp.cd('/tmp/'):
        t0 = time.time()
        for i in range(100):
            print(i)
            with sftp.open('test%i.txt' % i, 'wb') as f:   # even worse in a+ append mode: it takes 25 seconds
                f.write(os.urandom(100))
        print(time.time() - t0)

回答1:


I'd suggest you to parallelize the upload using multiple connections from multiple threads. That's easy and reliable solution.


If you want to do the hard way by using buffering the requests, you can base your solution on the following naive example.

The example:

  • Queues 100 file open requests;
  • As it reads the responses to the open requests, it queues write requests;
  • As it reads the responses to the write requests, it queues close requests

If I do plain SFTPClient.put for 100 files, it takes about 10-12 seconds. Using the code below, I achieve the same about 50-100 times faster.

But! The code is really naive:

  • It expects that the server responds to the requests in the same order. Indeed, majority of SFTP servers (including the de-facto standard OpenSSH) respond in the same order. But according to the SFTP specification, an SFTP server is free to respond in any order.
  • The code expects that all file reads happen in one go – upload.localhandle.read(32*1024). That's true for small files only.
  • The code expects that the SFTP server can handle 100 parallel requests and 100 opened files. That's not a problem for most servers, as they process the requests in order. And 100 opened files should not be a problem for a regular server.
  • You cannot do that for unlimited number of files though. You have to queue the files somehow to keep the number of outstanding requests in check. Actually even these 100 requests is too much.
  • The code uses non-public methods of SFTPClient class.
  • I do not do Python. There are definitely ways to code this more elegantly.
import paramiko
import paramiko.sftp
from paramiko.py3compat import long
 
ssh = paramiko.SSHClient()
ssh.connect(...)
 
sftp = ssh.open_sftp()
                      
class Upload:
   def __init__(self):
       pass

uploads = []

for i in range(0, 100):
    print(f"sending open request {i}")
    upload = Upload()
    upload.i = i
    upload.localhandle = open(f"{i}.dat")
    upload.remotepath = f"/remote/path/{i}.dat"
    imode = \
        paramiko.sftp.SFTP_FLAG_CREATE | paramiko.sftp.SFTP_FLAG_TRUNC | \
        paramiko.sftp.SFTP_FLAG_WRITE
    attrblock = paramiko.SFTPAttributes()
    upload.request = \
        sftp._async_request(type(None), paramiko.sftp.CMD_OPEN, upload.remotepath, \
            imode, attrblock)
    uploads.append(upload)

for upload in uploads:
    print(f"reading open response {upload.i}");
    t, msg = sftp._read_response(upload.request)
    if t != paramiko.sftp.CMD_HANDLE:
        raise SFTPError("Expected handle")
    upload.handle = msg.get_binary()

    print(f"sending write request {upload.i} to handle {upload.handle}");
    data = upload.localhandle.read(32*1024)
    upload.request = \
        sftp._async_request(type(None), paramiko.sftp.CMD_WRITE, \
            upload.handle, long(0), data)

for upload in uploads:
    print(f"reading write response {upload.i} {upload.request}");
    t, msg = sftp._read_response(upload.request)
    if t != paramiko.sftp.CMD_STATUS:
        raise SFTPError("Expected status")
    print(f"closing {upload.i} {upload.handle}");
    upload.request = \
        sftp._async_request(type(None), paramiko.sftp.CMD_CLOSE, upload.handle)

for upload in uploads:
    print(f"reading close response {upload.i} {upload.request}");
    sftp._read_response(upload.request)



回答2:


With the following method (100 asynchronous tasks), it's done in ~ 0.5 seconds, which is a massive improvement.

import asyncio, asyncssh  # pip install asyncssh
async def main():
    async with asyncssh.connect('1.2.3.4', username='root', password='') as conn:
        async with conn.start_sftp_client() as sftp:
            print('connected')
            await asyncio.wait([sftp.put('files/test%i.txt' % i) for i in range(100)])
asyncio.run(main())

I'll explore the source, but I still don't know if it groups many operations in few SSH transactions, or if it just runs commands in parallel.



来源:https://stackoverflow.com/questions/65106405/slow-upload-of-many-small-files-with-sftp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!