Is it possible to parallelize awk writing to multiple files through GNU parallel?

问题

I am running an awk script which I want to parallelize through GNU parallel.

This script demultiplexes one input file to multiple output files depending on a value on each line. The code is the following:

#!/usr/bin/awk -f

BEGIN{ FS=OFS="\t" }
{
    # bc is the field that defines to which file the line
    # will be written
    bc = $1
    # append line to such file
    print >> (bc".txt")
}

I want to parallelize it using GNU parallel through the following:

parallel --line-buffer --block 1G --pipe 'awk script.awk'

However, I am afraid of possible race conditions in which two awk processes are writing in the same file at the same time. Is it possible, and if yes how to avoid it without compromising parallelization?

NB. I included --line-buffer option although I'm not sure if it applies also to file redirection within the awk script. Does it apply also in this case or only to stdout of each awk process?

Example

# Input file
bc1    line1
bc3    line2
bc1    line3
bc2    line4


# Output file bc1.txt
bc1    line1
bc1    line3

# Output file bc2.txt
bc2    line4

# Output file bc3.txt
bc3    line2

回答1:

You can do it by demultiplexing the output in different dirs:

stuff |
  parallel --block 10M --pipe --round-robin \
    'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk'

Or if input is a file, you can use --pipepart which is faster:

parallel --block -1 --pipepart -a bigfile \
  'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk'

Then there is no race condition. Finish up by merging the dirs:

parallel 'cd {}; ls' ::: dir-* | sort -u |
  parallel 'cat */{} > {}'

If merging is not acceptable (maybe you do not have disk space for 2 copies of the data), you can use fifos. But to do that you need to know the names of all the .txt-files in advance and you need a system that can run one process per name in parallel (10000 names = 10000 processes):

# Generate names-of-files.txt somehow
# Make fifos for all names in all slots
parallel 'mkdir -p {2}; mkfifo {2}/{1}' :::: \
  names-of-files.txt <(seq $(parallel --number-of-threads) )
# Run the demultiplexer in the background
parallel --block -1 --pipepart -a bigfile \
  'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk' &
# Start one process per name
# If you have more than 32000 names, you will need to increase the number
# of processes on your system.
cat names-of-files.txt |
  parallel -j0 --pipe -N250 -I ,, parallel -j0 'parcat */{} > {}'

来源：https://stackoverflow.com/questions/52878292/is-it-possible-to-parallelize-awk-writing-to-multiple-files-through-gnu-parallel

标签

bash

file

awk

io-redirection

gnu-parallel