Shell: find files in a list under a directory

后端 未结 4 932
伪装坚强ぢ
伪装坚强ぢ 2020-12-03 08:26

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The followin

相关标签:
4条回答
  • 2020-12-03 08:59

    Use xargs(1) for the while loop can be a bit faster than in bash.

    Like this

    xargs -a filelist.txt -I filename find /dir -name filename
    

    Be careful if the file names in filelist.txt contains whitespaces, read the second paragraph in the DESCRIPTION section of xargs(1) manpage about this problem.

    An improvement based on some assumptions. For example, a.txt is in filelist.txt, and you can make sure there is only one a.txt in /dir. Then you can tell find(1) to exit early when it finds the instance.

    xargs -a filelist.txt -I filename find /dir -name filename -print -quit
    

    Another solution. You can pre-process the filelist.txt, make it into a find(1) arguments list like this. This will reduce find(1) invocations:

    find /dir -name 'a.txt' -or -name 'b.txt' -or -name 'c.txt'
    
    0 讨论(0)
  • 2020-12-03 09:14

    If filelist.txt is a plain list:

    $ find /dir | grep -F -f filelist.txt
    

    If filelist.txt is a pattern list:

    $ find /dir | grep -f filelist.txt
    
    0 讨论(0)
  • 2020-12-03 09:17

    If filelist.txt has a single filename per line:

    find /dir | grep -f <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)
    

    (The -f option means that grep searches for all the patterns in the given file.)

    Explanation of <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt):

    The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):

    sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
    find /dir | grep -f processed_filelist.txt
    

    The call to sed runs the commands s@^@/@, s/$/$/ and s/\([\.[\*]\|\]\)/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.

    • s@^@/@ means put a / at the before each filename. (The ^ means "start of line" in a regex)
    • s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").

    The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.

    s/\([\.[\*]\|\]\)/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).

    As an example:

    $ cat filelist.txt
    file1.txt
    file2.txt
    blah[2012].txt
    blah[2011].txt
    lastfile
    
    $ sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
    /file1\.txt$
    /file2\.txt$
    /blah\[2012\]\.txt$
    /blah\[2011\]\.txt$
    /lastfile$
    

    Grep then uses each line of that output as a pattern when it is searching the output of find.

    0 讨论(0)
  • 2020-12-03 09:24

    I'm not entirely sure of the question here, but I came to this page after trying to find a way to discover which 4 of 13000 files had failed to copy.

    Neither of the answers did it for me so I did this:

    cp file-list file-list2
    find dir/ >> file-list2
    sort file-list2 | uniq -u
    

    Which resulted with a list of the 4 files I needed.

    The idea is to combine the two file lists to determine the unique entries. sort is used to make duplicate entries adjacent to each other which is the only way uniq will filter them out.

    0 讨论(0)
提交回复
热议问题