Subprocess, repeatedly write to STDIN while reading from STDOUT (Windows)

问题

I want to call an external process from python. The process I'm calling reads an input string and gives tokenized result, and waits for another input (binary is MeCab tokenizer if that helps).

I need to tokenize thousands of lines of string by calling this process.

Problem is Popen.communicate() works but waits for the process to die before giving out the STDOUT result. I don't want to keep closing and opening new subprocesses for thousands of times. (And I don't want to send the whole text, it may easily grow over tens of thousands of -long- lines in future.)

from subprocess import PIPE, Popen

with Popen("mecab -O wakati".split(), stdin=PIPE,
           stdout=PIPE, stderr=PIPE, close_fds=False,
           universal_newlines=True, bufsize=1) as proc:
    output, errors = proc.communicate("foobarbaz")

print(output)

I've tried reading proc.stdout.read() instead of using communicate but it is blocked by stdin and doesn't return any results before proc.stdin.close() is called. Which, again means I need to create a new process everytime.

I've tried to implement queues and threads from a similar question as below, but it either doesn't return anything so it's stuck on While True, or when I force stdin buffer to fill by repeteadly sending strings, it outputs all the results at once.

from subprocess import PIPE, Popen
from threading import Thread
from queue import Queue, Empty

def enqueue_output(out, queue):
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

p = Popen('mecab -O wakati'.split(), stdout=PIPE, stdin=PIPE,
          universal_newlines=True, bufsize=1, close_fds=False)
q = Queue()
t = Thread(target=enqueue_output, args=(p.stdout, q))
t.daemon = True
t.start()

p.stdin.write("foobarbaz")
while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

Also looked at the Pexpect route, but it's windows port doesn't support some important modules (pty based ones), so I couldn't apply that as well.

I know there are a lot of similar answers, and I've tried most of them. But nothing I've tried seems to work on Windows.

EDIT: some info on the binary I'm using, when I use it via command line. It runs and tokenizes sentences I give, until I'm done and forcibly close the program.

(...waits_for_input -> input_recieved -> output -> waits_for_input...)

Thanks.

回答1:

If mecab uses C FILE streams with default buffering, then piped stdout has a 4 KiB buffer. The idea here is that a program can efficiently use small, arbitrary-sized reads and writes to the buffers, and the underlying standard I/O implementation handles automatically filling and flushing the much-larger buffers. This minimizes the number of required system calls and maximizes throughput. Obviously you don't want this behavior for interactive console or terminal I/O or writing to stderr. In these cases the C runtime uses line-buffering or no buffering.

A program can override this behavior, and some do have command-line options to set the buffer size. For example, Python has the "-u" (unbuffered) option and PYTHONUNBUFFERED environment variable. If mecab doesn't have a similar option, then there isn't a generic workaround on Windows. The C runtime situation is too complicated. A Windows process can link statically or dynamically to one or several CRTs. The situation on Linux is different since a Linux process generally loads a single system CRT (e.g. GNU libc.so.6) into the global symbol table, which allows an LD_PRELOAD library to configure the C FILE streams. Linux stdbuf uses this trick, e.g. stdbuf -o0 mecab -O wakati.

One option to experiment with is to call CreateConsoleScreenBuffer and get a file descriptor for the handle from msvcrt.open_osfhandle. Then pass this as stdout instead of using a pipe. The child process will see this as a TTY and use line buffering instead of full buffering. However managing this is non-trivial. It would involve reading (i.e. ReadConsoleOutputCharacter) a sliding buffer (call GetConsoleScreenBufferInfo to track the cursor position) that's actively written to by another process. This kind of interaction isn't something that I've ever needed or even experimented with. But I have used a console screen buffer non-interactively, i.e. reading the buffer after the child has exited. This allows reading up to 9,999 lines of output from programs that write directly to the console instead of stdout, e.g. programs that call WriteConsole or open "CON" or "CONOUT$".

回答2:

Here is a workaround for Windows. This should also be adaptable to other operating systems. Download a console emulator like ConEmu (https://conemu.github.io/) Start it instead of mecab as your subprocess.

p = Popen(['conemu'] , stdout=PIPE, stdin=PIPE,
      universal_newlines=True, bufsize=1, close_fds=False)

Then send the following as the first input:

mecab -O wakafi & exit

You are letting the emulator handle the file output issues for you; the way it normally does when you manually interact with it. I am still looking into this; but already looks promising...

Only problem is conemu is a gui application; so if no other way to hook into its input and output, one might have to tweak and rebuild from sources (it's open source). I haven't found any other way; but this should work.

I have asked the question about running in some sort of console mode here; so you can check that thread also for something. The author Maximus is on SO...

来源：https://stackoverflow.com/questions/42991214/subprocess-repeatedly-write-to-stdin-while-reading-from-stdout-windows

标签

python

windows

python-3.x

subprocess

mecab