Is there a faster way than this to find all the files in a directory and all sub directories?

前端 未结 16 1383
盖世英雄少女心
盖世英雄少女心 2020-11-29 17:49

I\'m writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a

相关标签:
16条回答
  • 2020-11-29 18:05

    It is horrible, and the reason file search work is horrible on Windows platforms is because MS made a mistake, that they seem unwilling to put right. You should be able to use SearchOption.AllDirectories And we would all get the speed back that we want. But you can not do that, because GetDirectories needs a call back so that you can decide what to do about the directories you do not have access to. MS forgot or did not think to test the class on their own computers.

    So, we are all left with the nonsense recursive loops.

    Within C#/Managed C++ you have very few oprions, these are also the options that MS take, because their coders haven't worked out how to get around it either.

    The main thing is with display items, such as TreeViews and FileViews, only search and show what users can see. There are plaenty of helpers on the controls, including triggers, that tell you when you need to fill in some data.

    In trees, starting from collapsed mode, search that one directory as and when the user opens it in the tree, that is much faster than the wait for a whole tree to be filled. The same in FileViews, I tend towards a 10% rule, how ever many items fit in the display area, have another 10% ready if the user scrolls, it is nicely responsive.

    MS do the pre-search and directory watch. A little database of directories, files, this means that you OnOpen your Trees etc have a good fast starting point, it falls down a bit on the refresh.

    But mix the two ideas, take your directories and files from the database, but do a refresh search as a tree node is expanded (just that tree node) and as a different directory is selected in the tree.

    But the better sollution is to add your file search system as a service. MS already have this, but as far as I know we do not get access to it, I suspect that is because it is immune to 'failed access to directory' errors. Just as with the MS one, if you have a service running at Admin level, you need to be careful that you are not giving away your security just for the sake of a little extra speed.

    0 讨论(0)
  • 2020-11-29 18:06

    The short answer of how to improve the performance of that code is: You cant.

    The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

    Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

    First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

    Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.

    In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

    public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
    {
        return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
    }
    
    public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
    {
        return depth == 0
            ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
            : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
                directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
    }
    

    On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.

    0 讨论(0)
  • 2020-11-29 18:08

    Consider splitting the updated method into two iterators:

    private static IEnumerable<DirectoryInfo> GetDirs(string rootFolderPath)
    {
         DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
         yield return rootDir;
    
         foreach(DirectoryInfo di in rootDir.GetDirectories("*", SearchOption.AllDirectories));
         {
              yield return di;
         }
         yield break;
    }
    
    public static IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
    {
         var allDirs = GetDirs(rootFolderPath);
         foreach(DirectoryInfo di in allDirs())
         {
              var files = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
              foreach(FileInfo fi in files)
              {
                   yield return fi;
              }
         }
         yield break;
    }
    

    Also, further to the network-specific scenario, if you were able to install a small service on that server that you could call into from a client machine, you'd get much closer to your "local folder" results, because the search could execute on the server and just return the results to you. This would be your biggest speed boost in the network folder scenario, but may not be available in your situation. I've been using a file synchronization program that includes this option -- once I installed the service on my server the program became WAY faster at identifying the files that were new, deleted, and out-of-sync.

    0 讨论(0)
  • 2020-11-29 18:12

    I needed to get all files from my C: partition so i combined Marc and Jaider answers and got function with no recursion and with parallel programming with result of about 370k files processed in 30 seconds. Maybe this will help someone:

    void DirSearch(string path)
        {
            ConcurrentQueue<string> pendingQueue = new ConcurrentQueue<string>();
            pendingQueue.Enqueue(path);
    
            ConcurrentBag<string> filesNames = new ConcurrentBag<string>();
            while(pendingQueue.Count > 0)
            {
                try
                {
                    pendingQueue.TryDequeue(out path);
    
                    var files = Directory.GetFiles(path);
    
                    Parallel.ForEach(files, x => filesNames.Add(x));
    
                    var directories = Directory.GetDirectories(path);
    
                    Parallel.ForEach(directories, (x) => pendingQueue.Enqueue(x));
                }
                catch (Exception)
                {
                    continue;
                }
            }
        }
    
    0 讨论(0)
  • 2020-11-29 18:13

    In .net core you can do something like this below. It can search for all subdirectories recursively with good performance and ignoring paths without access. I also tried other methods found in

    https://www.codeproject.com/Articles/1383832/System-IO-Directory-Alternative-using-WinAPI

    public static IEnumerable<string> ListFiles(string baseDir)
    {
        EnumerationOptions opt = new EnumerationOptions();
        opt.RecurseSubdirectories = true;
        opt.ReturnSpecialDirectories = false;
        //opt.AttributesToSkip = FileAttributes.Hidden | FileAttributes.System;
        opt.AttributesToSkip = 0;
        opt.IgnoreInaccessible = true;
    
        var tmp = Directory.EnumerateFileSystemEntries(baseDir, "*", opt);
        return tmp;
    }
    
    0 讨论(0)
  • 2020-11-29 18:14

    DirectoryInfo seems to give much more information than you need, try piping a dir command and parsing the info from that.

    0 讨论(0)
提交回复
热议问题