Run a specifiable number of commands in parallel - contrasting xargs -P, GNU parallel, and “moreutils” parallel

邮差的信 提交于 2019-12-04 15:05:10
mklement0

You can use xarg's -P option to run a specifiable number of invocations in parallel:

Note that the -P option is not mandated by POSIX, but both GNU xargs and BSD/macOS xargs support it.

xargs -P 3 -n 1 mongodump -h <<<'staging production web more stuff and so on'

This runs mongodump -h staging, mongodump -h production, and mongodump -h web in parallel, waits for all 3 calls to finish, then continues with mongodump -h more, mongodump -h stuff, and mongodump -h and, and so on.

-n 1 grabs a single argument from the input stream and calls mongodump; adjust as needed, single- or double-quoting arguments in the input if necessary.

Note: GNU xargs - but not BSD xargs - supports -P 0, where 0 means: "run as many processes as possible simultaneously."

By default, the arguments supplied via stdin are appended to the specified command.
If you need to control where the respective arguments are placed in the resulting commands,

  • provide the arguments line by line
  • use -I {} to indicate that, and to define {} as the placeholder for each input line.
xargs -P 3 -I {} mongodump -h {} after <<<$'staging\nproduction\nweb\nmore\nstuff'

Now each input arguments is substituted for {}, allowing argument after to come after.

Note, however, that each input line is invariably passed as a single argument.

BSD/macOS xargs would allow you to combine -n with -J {}, without needing to provide line-based input, but GNU xargs doesn't support -J.
In short: only BSD/macOS allows you to combine placement of the input arguments with reading multiple arguments at once.

Note that xargs does not serialize stdout output from commands in parallel, so that output from parallel processes can arrive interleaved.
Use GNU parallel to avoid this problem - see below.


Alternative: parallel

xargs has the advantage of being a standard utility, so on platforms where it supports -P, there are no prerequisites.

In the Linux world (though also on macOS via Homebrew) there are two purpose-built utilities for running commands in parallel, which, unfortunately, share the same name; typically, you must install them on demand:

  • parallel (a binary) from the moreutils package - see its home page.

  • The - much more powerful - GNU parallel (a Perl script) from the parallel package Thanks, twalberg. - see its home page.

If you already have a parallel utility, parallel --version will tell you which one it is (GNU parallel reports a version number and copyright information, "moreutils" parallel complains about an invalid option and shows a syntax summary).

Using the "moreutils" parallel:

parallel -j 3 -n 1 mongodump -h -- staging production web more stuff and so on

# Using -i to control placement of the argument, via {}
# Only *1* argument at at time supported in that case.
parallel -j 3 -i mongodump -h {} after -- staging production web more stuff and so on

Unlike xargs, this parallel implementation doesn't take the arguments to pass through from stdin; all pass-through arguments must be passed on the command line, following --.

From what I can tell, the only features this parallel implementation offers beyond what xargs can do is:

  • The -l option allows delaying further invocations until the system load overage is below the specified threshold.
  • Possibly this (from the man page): "stdout and stderr is serialised through a corresponding internal pipe, in order to prevent annoying concurrent output behaviour.", though I've found this not be the case in the version whose man page is dated 2009-07-2 - see last section.

Using GNU parallel:

Tip of the hat to Ole Tange for his help.

parallel -P 3 -n 1 mongodump -h <<<$'staging\nproduction\nweb\nmore\nstuff\nand\nso\non'

# Alternative, using ::: followed by the target-command arguments.
parallel -P 3 -n 1 mongodump -h ::: staging production web more stuff and so on 

# Using -n 1 and {} to control placement of the argument.
# Note that using -N rather than -n would allow per-argument placement control
# with {1}, {2}, ...
parallel -P 3 -n 1 mongodump -h {} after <<<$'staging\nproduction\nweb\nmore\nstuff\nand'
  • As with xargs, pass-through arguments are supplied via stdin, but GNU parallel also supports placing them on the command line, after a configurable separator (::: by default).

  • Unlike with xargs, each input line is considered a single argument.

  • Caveat: If your command involves quoted strings, you must use -q to pass them through as distinct arguments; e.g., parallel -q sh -c 'echo hi, $0' ::: there only works with -q.

  • As with GNU xargs, you can use -P 0 to run as many invocations as possible at once, taking full advantage of the machine's capabilities, meaning, according to Ole, "until GNU Parallel hits a limit (file handles and processes)".

    • Conveniently, omitting -P doesn't just run one process at a time, as the other utilities do, but runs one process per CPU core.
  • Output from commands being executed in parallel is by default automatically serialized (grouped) on per-process basis, to avoid interleaved output.

    • This is generally desirable, but note that it means that you'll only start to see the other commands' output once the first one that has created output has terminated.
    • Use option --line-buffer (--lb in more recent versions) to opt out of this behavior or
      -u (--ungroup) to allow even a single output line to mix output from different processes; see the manual for details.

GNU parallel, which is designed to be a better successor to xargs, offers many more features: a notable example is the ability to perform sophisticated transformations on the pass-through arguments, optionally based on Perl regular expressions; see also: man parallel and man parallel_tutorial.


Optional reading: testing output serialization behavior

The following commands test how xargs and the two parallel implements deal with interleaved output from commands being run in parallel - whether they show output as it arrives, or try to serialize it:

There are 2 levels of serialization, both of which introduce overhead:

  • Line-level serialization: Prevent partial lines from different processes to be mixed on a single output line.

  • Process-level serialization: Ensure that all output lines from a given process are grouped together.
    This is the most user-friendly method, but note that it means that you'll only start to see the other commands' output (in sequence) once the first one that has created output has terminated.

From what I can tell, only GNU parallel offers any serialization (despite what the "moreutils" parallel man page dated 2009-07-2 says[1] ), and it supports both methods.

The commands below assume the existence of executable script ./tst with the following content:

#!/usr/bin/env bash

printf "$$: [1/2] entering with arg(s): $*"
sleep $(( $RANDOM / 16384 ))
printf " $$: [2/2] finished entering\n"
echo "  $$: stderr line" >&2
echo "$$: stdout line"
sleep $(( $RANDOM / 8192 ))
echo "    $$: exiting"

xargs (both the GNU and BSD/macOS implementations, as found on Ubuntu 16.04 and macOS 10.12):

No serialization happens: a single output line can contain output from multiple processes.

$ xargs -P 3 -n 1 ./tst <<<'one two three'
2593: [1/2] entering with arg(s): one2594: [1/2] entering with arg(s): two 2593: [2/2] finished entering
  2593: stderr line
2593: stdout line
2596: [1/2] entering with arg(s): three   2593: exiting
 2594: [2/2] finished entering
  2594: stderr line
2594: stdout line
 2596: [2/2] finished entering
  2596: stderr line
2596: stdout line
   2594: exiting
   2596: exiting

"moreutils" parallel (version whose man page is dated 2009-07-02)

No serialization happens: a single output line can contain output from multiple processes.

$ parallel -j 3 ./tst -- one two three
3940: [1/2] entering with arg(s): one3941: [1/2] entering with arg(s): two3942: [1/2] entering with arg(s): three 3941: [2/2] finished entering
  3941: stderr line
3941: stdout line
 3942: [2/2] finished entering
  3942: stderr line
3942: stdout line
 3940: [2/2] finished entering
  3940: stderr line
3940: stdout line
   3941: exiting
   3942: exiting

GNU parallel (version 20170122)

Process-level serialization (grouping) happens by default. Use --line-buffer (--lb in newer versions) to choose line-level serialization instead, or opt out of any kind of serialization with -u
(--ungroup).

Note how, in each group, stderr output comes after stdout output (whereas the man page that comes with version 20170122 claims that stderr output comes first).

$ parallel -P 3 ./tst ::: one two three
2544: [1/2] entering with arg(s): one 2544: [2/2] finished entering
2544: stdout line
   2544: exiting
  2544: stderr line
2549: [1/2] entering with arg(s): three 2549: [2/2] finished entering
2549: stdout line
   2549: exiting
  2549: stderr line
2546: [1/2] entering with arg(s): two 2546: [2/2] finished entering
2546: stdout line
   2546: exiting
  2546: stderr line

[1] "stdout and stderr is serialised through a corresponding internal pipe, in order to prevent annoying concurrent output behaviour."
Do tell me if I'm missing something.

If you just exclude every 3rd & (or use a ; if it's all on one line) then it will not execute the whole thing in parallel.

Eg:

echo "Hello" & sleep 1 ;
echo "Hello Again" & sleep 1 ;
echo "Once More" & sleep 1 ;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!