问题
GNU Parallel
GNU parallel is a shell tool for executing jobs in parallel using one or more computers
For example, if I want to write a multicore version of wc
I could do:
cat XXX | parallel --block 10M --pipe wc -l | awk 'BEGIN{count=0;}{count = count+ $1;} END{print count;}'
My question is how to do sorting using parallel? I know what I should do is pipe the result of parallel to a "merge sorted files" command(just like the final merge in merge sort), but I don't know how to do that.
回答1:
There's a few ways to do this.
Let's get a simple text file to play with:
$ curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt 2>/dev/null |
tr " " "\n" | tr "[A-Z]" "[a-z]" |
sed -e 's/[[:punct:]]*//g' -e '/^[[:space:]]*$/d' > moby-dick-words.txt
$ wc moby-dick-words.txt
215117 moby-dick-words.txt
$ time sort moby-dick-words.txt > moby-dick-words-sorted.txt
real 0m0.260s
user 0m0.462s
sys 0m0.004s
We can do the sorting on chunks of the text, say 10000 words at a time, and defer some of the hard, serial work to the merging (sort -m
) part:
$ mkdir tmp
$ time (
cd tmp;
split -l 1000 ../moby-dick-words.txt;
parallel sort {} -o {}.sorted ::: x*;
sort -m *.sorted > ../moby-dick-words-sorted-merge.txt;
rm x* )
real 0m0.787s
user 0m0.495s
sys 0m0.103s
$ diff moby-dick-words-sorted.txt moby-dick-words-sorted-merge.txt
$ uniq -c moby-dick-sorted-merge.txt | tail
1 zeuglodon
1 zigzag
5 zodiac
1 zogranda
4 zone
1 zone
2 zoned
3 zones
2 zoology
1 zoroaster
So this splits the text into sequential 10000-line chunks, uses parallel to sort each chunk, and then uses sort -m
to merge the sorted chunks into a complete sort.
The next approach would be to do the hard work at the split stage, rather than the merge stage, so that the partial results can be merged together by a simple cat:
$ rm tmp/*
$ letters="a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9"
$ time (
cd tmp;
parallel sed -e "/^{}/w{}.txt" ../moby-dick-words.txt ::: $letters >& /dev/null;
parallel sort {}.txt -o {}.sorted.txt ::: $letters;
cat *.sorted.txt > ../moby-dick-words-sorted-split.txt;
rm *.txt )
real 0m1.015s
user 0m2.355s
sys 0m0.510s
$ diff moby-dick-words-sorted-split.txt moby-dick-words-sorted.txt
$ uniq -c moby-dick-words-sorted-split.txt | tail
1 zeuglodon
1 zigzag
5 zodiac
1 zogranda
4 zone
1 zone
2 zoned
3 zones
2 zoology
1 zoroaster
Here we (in parallel) split the file by the first character of the line; sort those files individually; and then the merge is a simple concatenate.
Note that this really for entertainment/educational purposes only; later versions of gnu sort have parallelism built in (look at the --parallel option) which will do a much better job than this. And a slicker version of the of the merge approach can be seen in this answer.
来源:https://stackoverflow.com/questions/27970953/how-to-write-multicore-sorting-using-gnu-parallel