pipe large amount of data to stdin while using subprocess.Popen

后端 未结 10 599
花落未央
花落未央 2020-12-08 11:08

I\'m kind of struggling to understand what is the python way of solving this simple problem.

My problem is quite simple. If you use the follwing code it will hang. T

相关标签:
10条回答
  • 2020-12-08 11:58

    Using the aiofiles & asyncio in python 3.5:

    A bit complicated, but you need only 1024 Bytes memory to writing in stdin!

    import asyncio
    import aiofiles
    import sys
    from os.path import dirname, join, abspath
    import subprocess as sb
    
    
    THIS_DIR = abspath(dirname(__file__))
    SAMPLE_FILE = join(THIS_DIR, '../src/hazelnut/tests/stuff/sample.mp4')
    DEST_PATH = '/home/vahid/Desktop/sample.mp4'
    
    
    async def async_file_reader(f, buffer):
        async for l in f:
            if l:
                buffer.append(l)
            else:
                break
        print('reader done')
    
    
    async def async_file_writer(source_file, target_file):
        length = 0
        while True:
            input_chunk = await source_file.read(1024)
            if input_chunk:
                length += len(input_chunk)
                target_file.write(input_chunk)
                await target_file.drain()
            else:
                target_file.write_eof()
                break
    
        print('writer done: %s' % length)
    
    
    async def main():
        dir_name = dirname(DEST_PATH)
        remote_cmd = 'ssh localhost mkdir -p %s && cat - > %s' % (dir_name, DEST_PATH)
    
        stdout, stderr = [], []
        async with aiofiles.open(SAMPLE_FILE, mode='rb') as f:
            cmd = await asyncio.create_subprocess_shell(
                remote_cmd,
                stdin=sb.PIPE,
                stdout=sb.PIPE,
                stderr=sb.PIPE,
            )
    
            await asyncio.gather(*(
                async_file_reader(cmd.stdout, stdout),
                async_file_reader(cmd.stderr, stderr),
                async_file_writer(f, cmd.stdin)
            ))
    
            print('EXIT STATUS: %s' % await cmd.wait())
    
        stdout, stderr = '\n'.join(stdout), '\n'.join(stderr)
    
        if stdout:
            print(stdout)
    
        if stderr:
            print(stderr, file=sys.stderr)
    
    
    if __name__ == '__main__':
        loop = asyncio.get_event_loop()
        loop.run_until_complete(main())
    

    Result:

    writer done: 383631
    reader done
    reader done
    EXIT STATUS: 0
    
    0 讨论(0)
  • 2020-12-08 12:04

    Here is an example (Python 3) of reading one record at a time from gzip using a pipe:

    cmd = 'gzip -dc compressed_file.gz'
    pipe = Popen(cmd, stdout=PIPE).stdout
    
    for line in pipe:
        print(":", line.decode(), end="")
    

    I know there is a standard module for that, it is just meant as an example. You can read the whole output in one go (like shell back-ticks) using the communicate method, but obviously you hav eto be careful of memory size.

    Here is an example (Python 3 again) of writing records to the lp(1) program on Linux:

    cmd = 'lp -'
    proc = Popen(cmd, stdin=PIPE)
    proc.communicate(some_data.encode())
    
    0 讨论(0)
  • 2020-12-08 12:06

    I was looking for an example code to iterate over process output incrementally as this process consumes its input from provided iterator (incrementally as well). Basically:

    import string
    import random
    
    # That's what I consider a very useful function, though didn't
    # find any existing implementations.
    def process_line_reader(args, stdin_lines):
        # args - command to run, same as subprocess.Popen
        # stdin_lines - iterable with lines to send to process stdin
        # returns - iterable with lines received from process stdout
        pass
    
    # Returns iterable over n random strings. n is assumed to be infinity if negative.
    # Just an example of function that returns potentially unlimited number of lines.
    def random_lines(n, M=8):
        while 0 != n:
            yield "".join(random.choice(string.letters) for _ in range(M))
            if 0 < n:
                n -= 1
    
    # That's what I consider to be a very convenient use case for
    # function proposed above.
    def print_many_uniq_numbered_random_lines():
        i = 0
        for line in process_line_reader(["uniq", "-i"], random_lines(100500 * 9000)):
            # Key idea here is that `process_line_reader` will feed random lines into
            # `uniq` process stdin as lines are consumed from returned iterable.
            print "#%i: %s" % (i, line)
            i += 1
    

    Some of solutions suggested here allow to do it with threads (but it's not always convenient) or with asyncio (which is not available in Python 2.x). Below is example of working implementation that allows to do it.

    import subprocess
    import os
    import fcntl
    import select
    
    class nonblocking_io(object):
        def __init__(self, f):
            self._fd = -1
            if type(f) is int:
                self._fd = os.dup(f)
                os.close(f)
            elif type(f) is file:
                self._fd = os.dup(f.fileno())
                f.close()
            else:
                raise TypeError("Only accept file objects or interger file descriptors")
            flag = fcntl.fcntl(self._fd, fcntl.F_GETFL)
            fcntl.fcntl(self._fd, fcntl.F_SETFL, flag | os.O_NONBLOCK)
        def __enter__(self):
            return self
        def __exit__(self, type, value, traceback):
            self.close()
            return False
        def fileno(self):
            return self._fd
        def close(self):
            if 0 <= self._fd:
                os.close(self._fd)
                self._fd = -1
    
    class nonblocking_line_writer(nonblocking_io):
        def __init__(self, f, lines, autoclose=True, buffer_size=16*1024, encoding="utf-8", linesep=os.linesep):
            super(nonblocking_line_writer, self).__init__(f)
            self._lines = iter(lines)
            self._lines_ended = False
            self._autoclose = autoclose
            self._buffer_size = buffer_size
            self._buffer_offset = 0
            self._buffer = bytearray()
            self._encoding = encoding
            self._linesep = bytearray(linesep, encoding)
        # Returns False when `lines` iterable is exhausted and all pending data is written
        def continue_writing(self):
            while True:
                if self._buffer_offset < len(self._buffer):
                    n = os.write(self._fd, self._buffer[self._buffer_offset:])
                    self._buffer_offset += n
                    if self._buffer_offset < len(self._buffer):
                        return True
                if self._lines_ended:
                    if self._autoclose:
                        self.close()
                    return False
                self._buffer[:] = []
                self._buffer_offset = 0
                while len(self._buffer) < self._buffer_size:
                    line = next(self._lines, None)
                    if line is None:
                        self._lines_ended = True
                        break
                    self._buffer.extend(bytearray(line, self._encoding))
                    self._buffer.extend(self._linesep)
    
    class nonblocking_line_reader(nonblocking_io):
        def __init__(self, f, autoclose=True, buffer_size=16*1024, encoding="utf-8"):
            super(nonblocking_line_reader, self).__init__(f)
            self._autoclose = autoclose
            self._buffer_size = buffer_size
            self._encoding = encoding
            self._file_ended = False
            self._line_part = ""
        # Returns (lines, more) tuple, where lines is iterable with lines read and more will
        # be set to False after EOF.
        def continue_reading(self):
            lines = []
            while not self._file_ended:
                data = os.read(self._fd, self._buffer_size)
                if 0 == len(data):
                    self._file_ended = True
                    if self._autoclose:
                        self.close()
                    if 0 < len(self._line_part):
                        lines.append(self._line_part.decode(self._encoding))
                        self._line_part = ""
                    break
                for line in data.splitlines(True):
                    self._line_part += line
                    if self._line_part.endswith(("\n", "\r")):
                        lines.append(self._line_part.decode(self._encoding).rstrip("\n\r"))
                        self._line_part = ""
                if len(data) < self._buffer_size:
                    break
            return (lines, not self._file_ended)
    
    class process_line_reader(object):
        def __init__(self, args, stdin_lines):
            self._p = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
            self._reader = nonblocking_line_reader(self._p.stdout)
            self._writer = nonblocking_line_writer(self._p.stdin, stdin_lines)
            self._iterator = self._communicate()
        def __iter__(self):
            return self._iterator
        def __enter__(self):
            return self._iterator
        def __exit__(self, type, value, traceback):
            self.close()
            return False
        def _communicate(self):
            read_set = [self._reader]
            write_set = [self._writer]
            while read_set or write_set:
                try:
                    rlist, wlist, xlist = select.select(read_set, write_set, [])
                except select.error, e:
                    if e.args[0] == errno.EINTR:
                        continue
                    raise
                if self._reader in rlist:
                    stdout_lines, more = self._reader.continue_reading()
                    for line in stdout_lines:
                        yield line
                    if not more:
                        read_set.remove(self._reader)
                if self._writer in wlist:
                    if not self._writer.continue_writing():
                        write_set.remove(self._writer)
            self.close()
        def lines(self):
            return self._iterator
        def close(self):
            if self._iterator is not None:
                self._reader.close()
                self._writer.close()
                self._p.wait()
                self._iterator = None
    
    0 讨论(0)
  • 2020-12-08 12:13

    Now I know this is not going to satisfy the purist in you completely, as the input will have to fit in memory, and you have no option to work interactively with input-output, but at least this works fine on your example. The communicate method optionally takes the input as an argument, and if you feed your process its input this way, it will work.

    import subprocess
    
    proc = subprocess.Popen(['cat','-'],
                            stdin=subprocess.PIPE,
                            stdout=subprocess.PIPE,
                            )
    
    input = "".join('{0:d}\n'.format(i) for i in range(100000))
    output = proc.communicate(input)[0]
    print output
    

    As for the larger problem, you can subclass Popen, rewrite __init__ to accept stream-like objects as arguments to stdin, stdout, stderr, and rewrite the _communicate method (hairy for crossplatform, you need to do it twice, see the subprocess.py source) to call read() on the stdin stream and write() the output to the stdout and stderr streams. What bothers me about this approach is that as far as I know, it hasn't already been done. When obvious things have not been done before, there's usually a reason (it doesn't work as intended), but I can't see why it shoudn't, apart from the fact you need the streams to be thread-safe in Windows.

    0 讨论(0)
提交回复
热议问题