I\'m testing subprocesses pipelines with python. I\'m aware that I can do what the programs below do in python directly, but that\'s not the point. I just want to test the pipel
Two principles need to be applied uniformly when working with large files in Python.
EOF
before starting to make progress.An alternative is to use nonblocking IO, though this is cumbersome in standard Python. See gevent for a lightweight threading library that implements the synchronous IO API using nonblocking primitives.
We'll construct a silly pipeline that is roughly
{cat /usr/share/dict/words} | grep -v not \
| {upcase, filtered tee to stderr} | cut -c 1-10 \
| {translate 'E' to '3'} | grep K | grep Z | {downcase}
where each stage in braces {}
is implemented in Python while the others use standard external programs. TL;DR: See this gist.
We start with the expected imports.
#!/usr/bin/env python
from subprocess import Popen, PIPE
import sys, threading
All but the last Python-implemented stage of the pipeline needs to go in a thread so that it's IO does not block the others. These could instead run in Python subprocesses if you wanted them to actually run in parallel (avoid the GIL).
def writer(output):
for line in open('/usr/share/dict/words'):
output.write(line)
output.close()
def filter(input, output):
for line in input:
if 'k' in line and 'z' in line: # Selective 'tee'
sys.stderr.write('### ' + line)
output.write(line.upper())
output.close()
def leeter(input, output):
for line in input:
output.write(line.replace('E', '3'))
output.close()
Each of these needs to be put in its own thread, which we'll do using this convenience function.
def spawn(func, **kwargs):
t = threading.Thread(target=func, kwargs=kwargs)
t.start()
return t
Create the external stages using Popen
and the Python stages using spawn
. The argument bufsize=-1
says to use the system default buffering (usually 4 kiB). This is generally faster than the default (unbuffered) or line buffering, but you'll want line buffering if you want to visually monitor the output without lags.
grepv = Popen(['grep','-v','not'], stdin=PIPE, stdout=PIPE, bufsize=-1)
cut = Popen(['cut','-c','1-10'], stdin=PIPE, stdout=PIPE, bufsize=-1)
grepk = Popen(['grep', 'K'], stdin=PIPE, stdout=PIPE, bufsize=-1)
grepz = Popen(['grep', 'Z'], stdin=grepk.stdout, stdout=PIPE, bufsize=-1)
twriter = spawn(writer, output=grepv.stdin)
tfilter = spawn(filter, input=grepv.stdout, output=cut.stdin)
tleeter = spawn(leeter, input=cut.stdout, output=grepk.stdin)
Assembled as above, all the buffers in the pipeline will fill up, but since nobody is reading from the end (grepz.stdout
), they will all block. We could read the entire thing in one call to grepz.stdout.read()
, but that would use a lot of memory for large files. Instead, we read incrementally.
for line in grepz.stdout:
sys.stdout.write(line.lower())
The threads and processes clean up once they reach EOF
. We can explicitly clean up using
for t in [twriter, tfilter, tleeter]: t.join()
for p in [grepv, cut, grepk, grepz]: p.wait()
Internally, subprocess.Popen
calls fork
, configures the pipe file descriptors, and calls exec
. The child process from fork
has copies of all file descriptors in the parent process, and both copies will need to be closed before the corresponding reader will get EOF
. This can be fixed by manually closing the pipes (either by close_fds=True
or a suitable preexec_fn
argument to subprocess.Popen
) or by setting the FD_CLOEXEC flag to have exec
automatically close the file descriptor. This flag is set automatically in Python-2.7 and later, see issue12786. We can get the Python-2.7 behavior in earlier versions of Python by calling
p._set_cloexec_flags(p.stdin)
before passing p.stdin
as an argument to a subsequent subprocess.Popen
.