I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The followin
Use xargs(1)
for the while loop can be a bit faster than in bash.
Like this
xargs -a filelist.txt -I filename find /dir -name filename
Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1) manpage about this problem.
An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1)
to exit early when it finds the instance.
xargs -a filelist.txt -I filename find /dir -name filename -print -quit
Another solution. You can pre-process the filelist.txt, make it into a find(1)
arguments list like this. This will reduce find(1)
invocations:
find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'
If filelist.txt
is a plain list:
$ find /dir | grep -F -f filelist.txt
If filelist.txt
is a pattern list:
$ find /dir | grep -f filelist.txt
If filelist.txt
has a single filename per line:
find /dir | grep -f <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)
(The -f
option means that grep searches for all the patterns in the given file.)
Explanation of <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)
:
The <( ... )
is called a process subsitution, and is a little similar to $( ... )
. The situation is equivalent to (but using the process substitution is neater and possibly a little faster):
sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt
The call to sed
runs the commands s@^@/@
, s/$/$/
and s/\([\.[\*]\|\]\)/\\\1/g
on each line of filelist.txt
and prints them out. These commands convert the filenames into a format that will work better with grep.
s@^@/@
means put a /
at the before each filename. (The ^
means "start of line" in a regex)s/$/$/
means put a $
at the end of each filename. (The first $
means "end of line", the second is just a literal $
which is then interpreted by grep to mean "end of line"). The combination of these two rules means that grep will only look for matches like .../<filename>
, so that a.txt
doesn't match ./a.txt.backup
or ./abba.txt
.
s/\([\.[\*]\|\]\)/\\\1/g
puts a \
before each occurrence of .
[
]
or *
. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt
would match files like abtxt
).
As an example:
$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile
$ sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$
Grep then uses each line of that output as a pattern when it is searching the output of find
.
I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.
Neither of the answers did it for me so I did this:
cp file-list file-list2
find dir/ >> file-list2
sort file-list2 | uniq -u
Which resulted with a list of the 4 files I needed.
The idea is to combine the two file lists to determine the unique entries.
sort
is used to make duplicate entries adjacent to each other which is the only way uniq
will filter them out.