Fast Linux file count for a large number of files

后端 未结 17 2530
名媛妹妹
名媛妹妹 2020-12-22 17:21

I\'m trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).

When the

相关标签:
17条回答
  • 2020-12-22 17:45

    I realized that not using in memory processing, when you have a huge amount of data, is faster than "piping" the commands. So I saved the result to a file and analyzed it afterwards:

    ls -1 /path/to/dir > count.txt && cat count.txt | wc -l
    
    0 讨论(0)
  • 2020-12-22 17:47

    find, ls, and perl tested against 40,000 files has the same speed (though I didn't try to clear the cache):

    [user@server logs]$ time find . | wc -l
    42917
    
    real    0m0.054s
    user    0m0.018s
    sys     0m0.040s
    
    [user@server logs]$ time /bin/ls -f | wc -l
    42918
    
    real    0m0.059s
    user    0m0.027s
    sys     0m0.037s
    

    And with Perl's opendir and readdir, the same time:

    [user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
    42918
    
    real    0m0.057s
    user    0m0.024s
    sys     0m0.033s
    

    Note: I used /bin/ls -f to make sure to bypass the alias option which might slow a little bit and -f to avoid file ordering. ls without -f is twice slower than find/perl except if ls is used with -f, it seems to be the same time:

    [user@server logs]$ time /bin/ls . | wc -l
    42916
    
    real    0m0.109s
    user    0m0.070s
    sys     0m0.044s
    

    I also would like to have some script to ask the file system directly without all the unnecessary information.

    The tests were based on the answers of Peter van der Heijden, glenn jackman, and mark4o.

    0 讨论(0)
  • 2020-12-22 17:48

    I prefer the following command to keep track of the changes in the number of files in a directory.

    watch -d -n 0.01 'ls | wc -l'
    

    The command will keeps a window open to keep track of the number of files that are in the directory with a refresh rate of 0.1 seconds.

    0 讨论(0)
  • 2020-12-22 17:56

    I came here when trying to count the files in a data set of approximately 10,000 folders with approximately 10,000 files each. The problem with many of the approaches is that they implicitly stat 100 million files, which takes ages.

    I took the liberty to extend the approach by Christopher Schultz so it supports passing directories via arguments (his recursive approach uses stat as well).

    Put the following into file dircnt_args.c:

    #include <stdio.h>
    #include <dirent.h>
    
    int main(int argc, char *argv[]) {
        DIR *dir;
        struct dirent *ent;
        long count;
        long countsum = 0;
        int i;
    
        for(i=1; i < argc; i++) {
            dir = opendir(argv[i]);
            count = 0;
            while((ent = readdir(dir)))
                ++count;
    
            closedir(dir);
    
            printf("%s contains %ld files\n", argv[i], count);
            countsum += count;
        }
        printf("sum: %ld\n", countsum);
    
        return 0;
    }
    

    After a gcc -o dircnt_args dircnt_args.c you can invoke it like this:

    dircnt_args /your/directory/*
    

    On 100 million files in 10,000 folders, the above completes quite quickly (approximately 5 minutes for the first run, and followup on cache: approximately 23 seconds).

    The only other approach that finished in less than an hour was ls with about 1 min on cache: ls -f /your/directory/* | wc -l. The count is off by a couple of newlines per directory though...

    Other than expected, none of my attempts with find returned within an hour :-/

    0 讨论(0)
  • 2020-12-22 17:58

    You can change the output based on your requirements, but here is a Bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories.

    dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }
    

    This looks recursively for all files (not directories) in the given directory and returns the results in a hash-like format. Simple tweaks to the find command could make what kind of files you're looking to count more specific, etc.

    It results in something like this:

    1 => 38,
    65 => 95052,
    66 => 12823,
    67 => 10572,
    69 => 67275,
    70 => 8105,
    71 => 42052,
    72 => 1184,
    
    0 讨论(0)
  • 2020-12-22 17:58

    You can get a count of files and directories with the tree program.

    Run the command tree | tail -n 1 to get the last line, which will say something like "763 directories, 9290 files". This counts files and folders recursively, excluding hidden files, which can be added with the flag -a. For reference, it took 4.8 seconds on my computer, for tree to count my whole home directory, which was 24,777 directories, 238,680 files. find -type f | wc -l took 5.3 seconds, half a second longer, so I think tree is pretty competitive speed-wise.

    As long as you don't have any subfolders, tree is a quick and easy way to count the files.

    Also, and purely for the fun of it, you can use tree | grep '^├' to only show the files/folders in the current directory - this is basically a much slower version of ls.

    0 讨论(0)
提交回复
热议问题