问题
I am running an awk script which I want to parallelize through GNU parallel.
This script demultiplexes one input file to multiple output files depending on a value on each line. The code is the following:
#!/usr/bin/awk -f
BEGIN{ FS=OFS="\t" }
{
# bc is the field that defines to which file the line
# will be written
bc = $1
# append line to such file
print >> (bc".txt")
}
I want to parallelize it using GNU parallel through the following:
parallel --line-buffer --block 1G --pipe 'awk script.awk'
However, I am afraid of possible race conditions in which two awk processes are writing in the same file at the same time. Is it possible, and if yes how to avoid it without compromising parallelization?
NB. I included --line-buffer
option although I'm not sure if it applies also to file redirection within the awk script. Does it apply also in this case or only to stdout of each awk process?
Example
# Input file
bc1 line1
bc3 line2
bc1 line3
bc2 line4
# Output file bc1.txt
bc1 line1
bc1 line3
# Output file bc2.txt
bc2 line4
# Output file bc3.txt
bc3 line2
回答1:
You can do it by demultiplexing the output in different dirs:
stuff |
parallel --block 10M --pipe --round-robin \
'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk'
Or if input is a file, you can use --pipepart
which is faster:
parallel --block -1 --pipepart -a bigfile \
'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk'
Then there is no race condition. Finish up by merging the dirs:
parallel 'cd {}; ls' ::: dir-* | sort -u |
parallel 'cat */{} > {}'
If merging is not acceptable (maybe you do not have disk space for 2 copies of the data), you can use fifos. But to do that you need to know the names of all the .txt
-files in advance and you need a system that can run one process per name in parallel (10000 names = 10000 processes):
# Generate names-of-files.txt somehow
# Make fifos for all names in all slots
parallel 'mkdir -p {2}; mkfifo {2}/{1}' :::: \
names-of-files.txt <(seq $(parallel --number-of-threads) )
# Run the demultiplexer in the background
parallel --block -1 --pipepart -a bigfile \
'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk' &
# Start one process per name
# If you have more than 32000 names, you will need to increase the number
# of processes on your system.
cat names-of-files.txt |
parallel -j0 --pipe -N250 -I ,, parallel -j0 'parcat */{} > {}'
来源:https://stackoverflow.com/questions/52878292/is-it-possible-to-parallelize-awk-writing-to-multiple-files-through-gnu-parallel