Shell: find files in a list under a directory

后端未结

关注

 4  932

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The followin

相关标签:

4条回答

别跟我提以往

2020-12-03 08:59
Use xargs(1) for the while loop can be a bit faster than in bash.

Like this
```
xargs -a filelist.txt -I filename find /dir -name filename
```
Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1) manpage about this problem.

An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1) to exit early when it finds the instance.
```
xargs -a filelist.txt -I filename find /dir -name filename -print -quit
```
Another solution. You can pre-process the filelist.txt, make it into a find(1) arguments list like this. This will reduce find(1) invocations:
```
find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-03 09:14
If filelist.txt is a plain list:
```
$ find /dir | grep -F -f filelist.txt
```
If filelist.txt is a pattern list:
```
$ find /dir | grep -f filelist.txt
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-12-03 09:17
If filelist.txt has a single filename per line:
```
find /dir | grep -f <(sed 's@^@/@; s/$/$/; s/$[\.[\*]\|\]$/\\\1/g' filelist.txt)
```
(The -f option means that grep searches for all the patterns in the given file.)

Explanation of <(sed 's@^@/@; s/$/$/; s/$[\.[\*]\|\]$/\\\1/g' filelist.txt):

The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):
```
sed 's@^@/@; s/$/$/; s/$[\.[\*]\|\]$/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt
```
The call to sed runs the commands s@^@/@, s/$/$/ and s/$[\.[\*]\|\]$/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.
- s@^@/@ means put a / at the before each filename. (The ^ means "start of line" in a regex)
- s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").
The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.

s/$[\.[\*]\|\]$/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).

As an example:
```
$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile

$ sed 's@^@/@; s/$/$/; s/$[\.[\*]\|\]$/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$
```
Grep then uses each line of that output as a pattern when it is searching the output of find.
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-03 09:24
I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.

Neither of the answers did it for me so I did this:
```
cp file-list file-list2
find dir/ >> file-list2
sort file-list2 | uniq -u
```
Which resulted with a list of the 4 files I needed.

The idea is to combine the two file lists to determine the unique entries. sort is used to make duplicate entries adjacent to each other which is the only way uniq will filter them out.
0 讨论(0)
发布评论:

提交评论
- 加载中...