问题
I would like to benefit from all the potential of parallel
command on macOS (it seems there exists 2 versions, GNU and Ole Tange's version but I am not sure).
With the following command:
parallel -j8 find {} ::: *
I will have a big performance if I am located in a directory containing 8 subdirectories. But if all these subdirectories have a small content except for only one, I will have only one thread which will work on the unique "big" directory.
Is there a way to follow the parallelization for this "big directory"? I mean, can the unique thread remaining be helped by other threads (the previous that worked on small subdirectories)?
The ideal case would be that parallel command "switch automatically" when all small sub has been found by
find
command in the command line above. Maybe I ask too much?Another potential optimization if it exists: considering a common tree directory structure: Is there a way, similar to for example the command
make -j8
, to assign each current thread to a sub-(sub-(sub- ....)))) directory and once the current directory has been explored (don't forget, I would like mostly to use this optimization withfind
Linux command), another thread explore another directory sub-(sub-(sub- ....)))) directory?Of course, the number of total threads running is not greater than the number specified with
parallel
command (parallel -j8
in my example above): we can say that if a number of tree elements (1 node=1 directory) are greater than a number of threads, we cannot be over this number.I know that parallelize in a recursive context is tricky but maybe I can gain a significant factor when I want to find a file into a big tree structure?
That's why I take the example of command
make -j8
: I don't know how it is coded but that makes me think that we could do the same with the coupleparallel/find
command line at the beginning of my post.
Finally, I would like to get your advice about these 2 questions and more generally what is possible and what is not possible currently for these suggestions of optimization in order to find more quickly a file with classical find
command.
UPDATE 1: As @OleTange said, I don't know the directory structure a priori of what I want gupdatedb
to index. So, it is difficult to know the maxdepth
in advance. Your solution is interesting but the first execution of find
is not multithreaded, you don't use parallel
command. I am a little surprised that a multithread version of gupdatedb
does not exist : on paper, it is faisible but once we want to code it in the script GNU gupdatedb
of MacOS 10.15, it is more difficult.
If someone could have other suggestions, I would take them !
回答1:
If you are going to parallelize find
you need to be sure that your disk can deliver data.
For magnetic drives you will rarely see a speedup. For RAID, network drives and SSD sometimes, and for NVMe often.
The simplest way to parallelize find
is to use */*
:
parallel find ::: */*
Or */*/*
:
parallel find ::: */*/*
This will search in sub-sub dirs and in sub-sub-sub dirs.
They will not search the top dirs, but that can be done by running a single additional find
with the appropriate -maxdepth
.
The above solution assumes you know something about the directory structure, so it is not a general solution.
I have never heard of a general solution. It would involve a breadth first search that would start some workers in parallel. I can see how it could be programmed, but I have never seen it.
If I were to implement it, it would be something like this (lightly tested):
#!/bin/bash
tmp=$(tempfile)
myfind() {
find "$1" -mindepth 1 -maxdepth 1
}
export -f myfind
myfind . | tee $tmp
while [ -s $tmp ] ; do
tmp2=$(tempfile)
cat $tmp | parallel --lb myfind | tee $tmp2
mv $tmp2 $tmp
done
rm $tmp
(PS: I have reason to believe the parallel written by Ole Tange and GNU Parallel are one and the same).
来源:https://stackoverflow.com/questions/63332050/gnu-parallel-assign-one-thread-for-each-node-directories-and-sub-directories