Parallel processing in awk?

故事扮演 提交于 2021-02-07 07:00:23

问题


Awk processes the files line by line. Assuming each line operation has no dependency on other lines, is there any way to make awk process multiple lines at a time in parallel? Is there any other text processing tool which automatically exploits parallelism and processes the data quicker ?


回答1:


The only awk implementation that was attempting to provide a parallel implementation of awk was parallel-awk but it looks like the project is dead now.

Otherwise, one way to parallelize awk is be to split your input in chunks and process them in parallel. However, splitting the input data would still be single threaded so might defeat the performance enhancement goal, the main issue being the standard split command is unable to split at line boundaries without reading each and every line.

If you have GNU split available, or a version that support the -n l/* option, here is one optimized way to process your file in parallel, assuming here you have 8 vCPUs:

inputfile=input.txt
outputfile=output.txt
script=script.awk
count=8

split -n l/$count $inputfile /tmp/_pawk$$
for file in /tmp/_pawk$$*; do
    awk -f script.awk $file > ${file}.out &
done
wait
cat /tmp/_pawk$$*.out > $outputfile
rm /tmp/_pawk$$*



回答2:


You can use GNU Parallel for this purpose

Consider you are counting the sum of numbers in a big file:

cat rands20M.txt | awk '{s+=$1} END {print s}'

With GNU Parallel you can do it in multiple threads:

cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'



来源:https://stackoverflow.com/questions/20308443/parallel-processing-in-awk

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!