How to write multicore sorting using GNU Parallel

问题

GNU Parallel

GNU parallel is a shell tool for executing jobs in parallel using one or more computers

For example, if I want to write a multicore version of wc I could do:

cat XXX | parallel --block 10M --pipe wc -l | awk 'BEGIN{count=0;}{count = count+ $1;} END{print count;}'

My question is how to do sorting using parallel? I know what I should do is pipe the result of parallel to a "merge sorted files" command(just like the final merge in merge sort), but I don't know how to do that.

回答1:

There's a few ways to do this.

Let's get a simple text file to play with:

$ curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt 2>/dev/null |
   tr " " "\n" | tr "[A-Z]" "[a-z]" | 
   sed -e 's/[[:punct:]]*//g' -e '/^[[:space:]]*$/d' > moby-dick-words.txt

$ wc moby-dick-words.txt

215117 moby-dick-words.txt
$ time sort moby-dick-words.txt > moby-dick-words-sorted.txt

real    0m0.260s
user    0m0.462s
sys 0m0.004s

We can do the sorting on chunks of the text, say 10000 words at a time, and defer some of the hard, serial work to the merging (sort -m) part:

$ mkdir tmp
$ time (
  cd tmp;
  split -l 1000 ../moby-dick-words.txt;
  parallel sort {} -o {}.sorted ::: x*;
  sort -m *.sorted > ../moby-dick-words-sorted-merge.txt;
  rm x* )

real    0m0.787s
user    0m0.495s
sys 0m0.103s

$ diff moby-dick-words-sorted.txt moby-dick-words-sorted-merge.txt 

$ uniq -c moby-dick-sorted-merge.txt | tail      
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

So this splits the text into sequential 10000-line chunks, uses parallel to sort each chunk, and then uses sort -m to merge the sorted chunks into a complete sort.

The next approach would be to do the hard work at the split stage, rather than the merge stage, so that the partial results can be merged together by a simple cat:

  $ rm tmp/*
  $ letters="a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9"
  $ time (
    cd tmp; 
    parallel sed -e "/^{}/w{}.txt" ../moby-dick-words.txt ::: $letters >& /dev/null;
    parallel sort {}.txt -o {}.sorted.txt ::: $letters;  
    cat *.sorted.txt > ../moby-dick-words-sorted-split.txt;
    rm *.txt )

  real  0m1.015s
  user  0m2.355s
  sys   0m0.510s
  $ diff moby-dick-words-sorted-split.txt moby-dick-words-sorted.txt
  $ uniq -c moby-dick-words-sorted-split.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

Here we (in parallel) split the file by the first character of the line; sort those files individually; and then the merge is a simple concatenate.

Note that this really for entertainment/educational purposes only; later versions of gnu sort have parallelism built in (look at the --parallel option) which will do a much better job than this. And a slicker version of the of the merge approach can be seen in this answer.

来源：https://stackoverflow.com/questions/27970953/how-to-write-multicore-sorting-using-gnu-parallel

标签

multithreading

bash

sorting

parallel-processing

gnu-parallel