问题
I am calling a Perl script from Python 3.7.3, with subprocess. The Perl script that is called is this one:
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
And the code I am using to call it is:
import sys
import os
import subprocess
import threading
def copy_out(source, dest):
for line in source:
dest.write(line)
num_threads=4
args = ["perl", "tokenizer.perl",
"-l", "en",
"-threads", str(num_threads)
]
with open(os.devnull, "wb") as devnull:
tokenizer = subprocess.Popen(args,
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=devnull)
tokenizer_thread = threading.Thread(target=copy_out, args=(tokenizer.stdout, open("outfile", "wb")))
tokenizer_thread.start()
num_lines = 100000
for _ in range(num_lines):
tokenizer.stdin.write(b'Random line.\n')
tokenizer.stdin.close()
tokenizer_thread.join()
tokenizer.wait()
On my system, this leads to the following error:
Traceback (most recent call last):
File "t.py", line 27, in <module>
tokenizer.stdin.write(b'Random line.\n')
BrokenPipeError: [Errno 32] Broken pipe
I investigated this, and it turns out that if the -threads
argument for the subprocess is 1 the error is not thrown. As I don't want to give up on multithreading in the child process, my question is:
What is causing this error in the first place? "Who" is to blame for it: OS / environment, my Python code, the Perl code?
I am glad to provide more information if needed.
EDIT: To respond to some comments,
- Running the Perl script is only possible if you also have this file: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
- The Perl script actually processes several thousands of lines before the process fails. In my Python script above, if I make
num_lines
smaller, I do not get this error anymore. - If I invoke this Perl script simply on the command line, without any Python, it works fine:
no matter how many (Perl) threadsor lines of input. - My Python variable
num_threads
only controls the number of threads of the Perl subprocess. I never start several Python threads, just one.
EDIT 2: In my first edit, I incorrectly stated that this Perl program runs fine when called with e.g. -threads 4
from the command line: there, a different Perl was used that is compiled with multithreading. If I use the same Perl that is invoked from Python, I get:
$ cat [file with 100000 lines] | [correct perl] tokenizer.perl -l en -threads 4
Can't locate object method "new" via package "Thread" at
tokenizer.perl line 130, <STDIN> line 8000.
Which no doubt would have helped me debug this better.
回答1:
The problem seems to be that the perl script crashes if perl
does not support threads. You can check if your perl
supports threads by running:
perl -MConfig -E 'say "Threads supported" if $Config{useithreads}'
In my case, the output was empty so I installed a new perl with thread support:
perlbrew install perl-5.30.0 --as=5.30.0-threads -Dusethreads
perlbrew use 5.30.0-threads
Then I ran the Python script again:
import sys
import os
import subprocess
import threading
def copy_out(source, dest):
for line in iter(source.readline, b''):
dest.write(line)
num_threads=4
args = ["perl", "tokenizer.perl",
"-l", "en",
"-threads", str(num_threads)
]
tokenizer = subprocess.Popen(
args,
bufsize=-1, #use default bufsize = 8192 bytes
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL)
tokenizer_thread = threading.Thread(
target=copy_out, args=(tokenizer.stdout, open("outfile", "wb")))
tokenizer_thread.start()
num_lines = 100000
for _ in range(num_lines):
tokenizer.stdin.write(b'Random line.\n')
tokenizer.stdin.close()
tokenizer_thread.join()
tokenizer.wait()
and it now ran to the end with no errors and produced the output file outfile
with 100000 lines.
回答2:
What is causing this error in the first place?
Writing to a closed pipe causes the OS to send SIGPIPE
to the process calling write
. This allows program to work as generators. For example, the following won't run forever despite containing an infinite loop, because head
will exit and close its STDIN after reading ten lines, leading to perl
receiving a SIGPIPE.
perl -le'1 while print ++$i;' | head
If the SIGPIPE
signal is being ignored, the write
system call will return EPIPE
(Broken pipe) instead. The following won't run forever either because print
returns error EPIPE
once head
exits.
perl -le'$SIG{PIPE}="IGNORE"; 1 while print ++$i;' | head
From the fact that your Python program received an EPIPE
error, we deduce two facts:
- The Python program ignores
SIGPIPE
signals, and - All handles to the reader end of the pipe were closed.
So we must ask ourselves: Why would the Perl program close its STDIN? it's very unlikely that its STDIN was closed explicitly. By far, the most likely explanation is that the child process was terminated.
"Who" is to blame for it: OS / environment, my Python code, the Perl code?
That depends on what caused the Perl program to exit. The first thing to do is figure out what exit status was returned by the child process. Depending on the exit status, we'll know whether
- the process was killed by a signal,
- the process exited with an error, or
- the process completed successfully.
If the exit code tells us the process was killed by a signal, the exit code will also tells us by which signal. This could give us some information. (This would be the hardest of the three scenarios to debug.)
If the exit code tells us the process returned an error, the error code itself might not contain any additional useful information, but an error message was surely sent to the child's STDERR to provide more information.
If the exit code tells us the process completed successfully, perhaps the arguments or input you are providing don't mean what you think they mean.
So make sure to call tokenizer.wait()
to collect the exit status and store it in tokenizer.returncode
. Also make sure to log what is being sent to STDERR.
来源:https://stackoverflow.com/questions/61343709/multithreaded-perl-script-leads-to-broken-pipe-if-called-as-a-python-subprocess