Fast Linux file count for a large number of files

后端 未结 17 2532
名媛妹妹
名媛妹妹 2020-12-22 17:21

I\'m trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).

When the

相关标签:
17条回答
  • 2020-12-22 18:04

    By default ls sorts the names, which can take a while if there are a lot of them. Also there will be no output until all of the names are read and sorted. Use the ls -f option to turn off sorting.

    ls -f | wc -l
    

    Note that this will also enable -a, so ., .., and other files starting with . will be counted.

    0 讨论(0)
  • 2020-12-22 18:05

    The fastest way is a purpose-built program, like this:

    #include <stdio.h>
    #include <dirent.h>
    
    int main(int argc, char *argv[]) {
        DIR *dir;
        struct dirent *ent;
        long count = 0;
    
        dir = opendir(argv[1]);
    
        while((ent = readdir(dir)))
                ++count;
    
        closedir(dir);
    
        printf("%s contains %ld files\n", argv[1], count);
    
        return 0;
    }
    

    From my testing without regard to cache, I ran each of these about 50 times each against the same directory, over and over, to avoid cache-based data skew, and I got roughly the following performance numbers (in real clock time):

    ls -1  | wc - 0:01.67
    ls -f1 | wc - 0:00.14
    find   | wc - 0:00.22
    dircnt | wc - 0:00.04
    

    That last one, dircnt, is the program compiled from the above source.

    EDIT 2016-09-26

    Due to popular demand, I've re-written this program to be recursive, so it will drop into subdirectories and continue to count files and directories separately.

    Since it's clear some folks want to know how to do all this, I have a lot of comments in the code to try to make it obvious what's going on. I wrote this and tested it on 64-bit Linux, but it should work on any POSIX-compliant system, including Microsoft Windows. Bug reports are welcome; I'm happy to update this if you can't get it working on your AIX or OS/400 or whatever.

    As you can see, it's much more complicated than the original and necessarily so: at least one function must exist to be called recursively unless you want the code to become very complex (e.g. managing a subdirectory stack and processing that in a single loop). Since we have to check file types, differences between different OSs, standard libraries, etc. come into play, so I have written a program that tries to be usable on any system where it will compile.

    There is very little error checking, and the count function itself doesn't really report errors. The only calls that can really fail are opendir and stat (if you aren't lucky and have a system where dirent contains the file type already). I'm not paranoid about checking the total length of the subdir pathnames, but theoretically, the system shouldn't allow any path name that is longer than than PATH_MAX. If there are concerns, I can fix that, but it's just more code that needs to be explained to someone learning to write C. This program is intended to be an example of how to dive into subdirectories recursively.

    #include <stdio.h>
    #include <dirent.h>
    #include <string.h>
    #include <stdlib.h>
    #include <limits.h>
    #include <sys/stat.h>
    
    #if defined(WIN32) || defined(_WIN32) 
    #define PATH_SEPARATOR '\\' 
    #else
    #define PATH_SEPARATOR '/' 
    #endif
    
    /* A custom structure to hold separate file and directory counts */
    struct filecount {
      long dirs;
      long files;
    };
    
    /*
     * counts the number of files and directories in the specified directory.
     *
     * path - relative pathname of a directory whose files should be counted
     * counts - pointer to struct containing file/dir counts
     */
    void count(char *path, struct filecount *counts) {
        DIR *dir;                /* dir structure we are reading */
        struct dirent *ent;      /* directory entry currently being processed */
        char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
        /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
    #if !defined ( _DIRENT_HAVE_D_TYPE )
        struct stat statbuf;     /* buffer for stat() info */
    #endif
    
    /* fprintf(stderr, "Opening dir %s\n", path); */
        dir = opendir(path);
    
        /* opendir failed... file likely doesn't exist or isn't a directory */
        if(NULL == dir) {
            perror(path);
            return;
        }
    
        while((ent = readdir(dir))) {
          if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
              fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
              return;
          }
    
    /* Use dirent.d_type if present, otherwise use stat() */
    #if defined ( _DIRENT_HAVE_D_TYPE )
    /* fprintf(stderr, "Using dirent.d_type\n"); */
          if(DT_DIR == ent->d_type) {
    #else
    /* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
          sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
          if(lstat(subpath, &statbuf)) {
              perror(subpath);
              return;
          }
    
          if(S_ISDIR(statbuf.st_mode)) {
    #endif
              /* Skip "." and ".." directory entries... they are not "real" directories */
              if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
    /*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
              } else {
                  sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
                  counts->dirs++;
                  count(subpath, counts);
              }
          } else {
              counts->files++;
          }
        }
    
    /* fprintf(stderr, "Closing dir %s\n", path); */
        closedir(dir);
    }
    
    int main(int argc, char *argv[]) {
        struct filecount counts;
        counts.files = 0;
        counts.dirs = 0;
        count(argv[1], &counts);
    
        /* If we found nothing, this is probably an error which has already been printed */
        if(0 < counts.files || 0 < counts.dirs) {
            printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
        }
    
        return 0;
    }
    

    EDIT 2017-01-17

    I've incorporated two changes suggested by @FlyingCodeMonkey:

    1. Use lstat instead of stat. This will change the behavior of the program if you have symlinked directories in the directory you are scanning. The previous behavior was that the (linked) subdirectory would have its file count added to the overall count; the new behavior is that the linked directory will count as a single file, and its contents will not be counted.
    2. If the path of a file is too long, an error message will be emitted and the program will halt.

    EDIT 2017-06-29

    With any luck, this will be the last edit of this answer :)

    I've copied this code into a GitHub repository to make it a bit easier to get the code (instead of copy/paste, you can just download the source), plus it makes it easier for anyone to suggest a modification by submitting a pull-request from GitHub.

    The source is available under Apache License 2.0. Patches* welcome!


    • "patch" is what old people like me call a "pull request".
    0 讨论(0)
  • 2020-12-22 18:05

    You should use "getdents" in place of ls/find

    Here is one very good article which described the getdents approach.

    http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

    Here is the extract:

    ls and practically every other method of listing a directory (including Python's os.listdir and find .) rely on libc readdir(). However, readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (e.g., 500 million directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() system call directly, rather than helper methods from the C standard library.

    We can find the C code to list the files using getdents() from here:

    There are two modifications you will need to do in order quickly list all the files in a directory.

    First, increase the buffer size from X to something like 5 megabytes.

    #define BUF_SIZE 1024*1024*5
    

    Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode == 0. I did this by adding

    if (dp->d_ino != 0) printf(...);
    

    In my case I also really only cared about the file names in the directory so I also rewrote the printf() statement to only print the filename.

    if(d->d_ino) printf("%sn ", (char *) d->d_name);
    

    Compile it (it doesn't need any external libraries, so it's super simple to do)

    gcc listdir.c -o listdir
    

    Now just run

    ./listdir [directory with an insane number of files]
    
    0 讨论(0)
  • 2020-12-22 18:07

    This answer here is faster than almost everything else on this page for very large, very nested directories:

    https://serverfault.com/a/691372/84703

    locate -r '.' | grep -c "^$PWD"

    0 讨论(0)
  • 2020-12-22 18:08

    You could try if using opendir() and readdir() in Perl is faster. For an example of those function, look here.

    0 讨论(0)
提交回复
热议问题