Fast Linux file count for a large number of files

后端未结

关注

 17  2530

I\'m trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).

When the

相关标签:

17条回答

猫巷女王i

2020-12-22 17:45
I realized that not using in memory processing, when you have a huge amount of data, is faster than "piping" the commands. So I saved the result to a file and analyzed it afterwards:
```
ls -1 /path/to/dir > count.txt && cat count.txt | wc -l
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-12-22 17:47
find, ls, and perl tested against 40,000 files has the same speed (though I didn't try to clear the cache):
```
[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s

[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s
```
And with Perl's opendir and readdir, the same time:
```
[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s
```
Note: I used /bin/ls -f to make sure to bypass the alias option which might slow a little bit and -f to avoid file ordering. ls without -f is twice slower than find/perl except if ls is used with -f, it seems to be the same time:
```
[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s
```
I also would like to have some script to ask the file system directly without all the unnecessary information.

_{The tests were based on the answers of Peter van der Heijden, glenn jackman, and mark4o.}
0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2020-12-22 17:48
I prefer the following command to keep track of the changes in the number of files in a directory.
```
watch -d -n 0.01 'ls | wc -l'
```
The command will keeps a window open to keep track of the number of files that are in the directory with a refresh rate of 0.1 seconds.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-12-22 17:56
I came here when trying to count the files in a data set of approximately 10,000 folders with approximately 10,000 files each. The problem with many of the approaches is that they implicitly stat 100 million files, which takes ages.

I took the liberty to extend the approach by Christopher Schultz so it supports passing directories via arguments (his recursive approach uses stat as well).

Put the following into file dircnt_args.c:
```
#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count;
    long countsum = 0;
    int i;

    for(i=1; i < argc; i++) {
        dir = opendir(argv[i]);
        count = 0;
        while((ent = readdir(dir)))
            ++count;

        closedir(dir);

        printf("%s contains %ld files\n", argv[i], count);
        countsum += count;
    }
    printf("sum: %ld\n", countsum);

    return 0;
}
```
After a gcc -o dircnt_args dircnt_args.c you can invoke it like this:
```
dircnt_args /your/directory/*
```
On 100 million files in 10,000 folders, the above completes quite quickly (approximately 5 minutes for the first run, and followup on cache: approximately 23 seconds).

The only other approach that finished in less than an hour was ls with about 1 min on cache: ls -f /your/directory/* | wc -l. The count is off by a couple of newlines per directory though...

Other than expected, none of my attempts with find returned within an hour :-/
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-12-22 17:58
You can change the output based on your requirements, but here is a Bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories.
```
dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }
```
This looks recursively for all files (not directories) in the given directory and returns the results in a hash-like format. Simple tweaks to the find command could make what kind of files you're looking to count more specific, etc.

It results in something like this:
```
1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-12-22 17:58

You can get a count of files and directories with the tree program.

Run the command tree | tail -n 1 to get the last line, which will say something like "763 directories, 9290 files". This counts files and folders recursively, excluding hidden files, which can be added with the flag -a. For reference, it took 4.8 seconds on my computer, for tree to count my whole home directory, which was 24,777 directories, 238,680 files. find -type f | wc -l took 5.3 seconds, half a second longer, so I think tree is pretty competitive speed-wise.

As long as you don't have any subfolders, tree is a quick and easy way to count the files.

Also, and purely for the fun of it, you can use tree | grep '^├' to only show the files/folders in the current directory - this is basically a much slower version of ls.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页