I\'m writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a
For file and directory search purpose I would want to offer use multithreading .NET library that possess a wide search opportunities. All information about library you can find on GitHub: https://github.com/VladPVS/FastSearchLibrary
If you want to download it you can do it here: https://github.com/VladPVS/FastSearchLibrary/releases
Works really fast. Check it yourself!
If you have any questions please ask them.
It is one demonstrative example how you can use it:
class Searcher
{
private static object locker = new object();
private FileSearcher searcher;
List<FileInfo> files;
public Searcher()
{
files = new List<FileInfo>(); // create list that will contain search result
}
public void Startsearch()
{
CancellationTokenSource tokenSource = new CancellationTokenSource();
// create tokenSource to get stop search process possibility
searcher = new FileSearcher(@"C:\", (f) =>
{
return Regex.IsMatch(f.Name, @".*[Dd]ragon.*.jpg$");
}, tokenSource); // give tokenSource in constructor
searcher.FilesFound += (sender, arg) => // subscribe on FilesFound event
{
lock (locker) // using a lock is obligatorily
{
arg.Files.ForEach((f) =>
{
files.Add(f); // add the next part of the received files to the results list
Console.WriteLine($"File location: {f.FullName}, \nCreation.Time: {f.CreationTime}");
});
if (files.Count >= 10) // one can choose any stopping condition
searcher.StopSearch();
}
};
searcher.SearchCompleted += (sender, arg) => // subscribe on SearchCompleted event
{
if (arg.IsCanceled) // check whether StopSearch() called
Console.WriteLine("Search stopped.");
else
Console.WriteLine("Search completed.");
Console.WriteLine($"Quantity of files: {files.Count}"); // show amount of finding files
};
searcher.StartSearchAsync();
// start search process as an asynchronous operation that doesn't block the called thread
}
}
It's another example:
***
List<string> folders = new List<string>
{
@"C:\Users\Public",
@"C:\Windows\System32",
@"D:\Program Files",
@"D:\Program Files (x86)"
}; // list of search directories
List<string> keywords = new List<string> { "word1", "word2", "word3" }; // list of search keywords
FileSearcherMultiple multipleSearcher = new FileSearcherMultiple(folders, (f) =>
{
if (f.CreationTime >= new DateTime(2015, 3, 15) &&
(f.Extension == ".cs" || f.Extension == ".sln"))
foreach (var keyword in keywords)
if (f.Name.Contains(keyword))
return true;
return false;
}, tokenSource, ExecuteHandlers.InCurrentTask, true);
***
Try this iterator block version that avoids recursion and the Info
objects:
public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
Queue<string> pending = new Queue<string>();
pending.Enqueue(rootFolderPath);
string[] tmp;
while (pending.Count > 0)
{
rootFolderPath = pending.Dequeue();
try
{
tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
}
catch (UnauthorizedAccessException)
{
continue;
}
for (int i = 0; i < tmp.Length; i++)
{
yield return tmp[i];
}
tmp = Directory.GetDirectories(rootFolderPath);
for (int i = 0; i < tmp.Length; i++)
{
pending.Enqueue(tmp[i]);
}
}
}
Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)
The BCL methods are portable so to speak. If staying 100% managed I believe the best you can do is calling GetDirectories/Folders while checking access rights (or possibly not checking the rights and having another thread ready to go when the first one takes a little too long - a sign that it's about to throw UnauthorizedAccess exception - this might be avoidable with exception filters using VB or as of today unreleased c#).
If you want faster than GetDirectories you have to call win32 (findsomethingEx etc) which provides specific flags that allow ignore possibly unnecessary IO while traversing the MFT structures. Also if the drive is a network share, there can be a great speedup by similar approach but this time avoiding also excessive network roundtrips.
Now if you have admin and use ntfs and are in a real hurry with millions of files to go through, the absolute fastest way to go through them (assuming spinning rust where the disk latency kills) is use of both mft and journaling in combination, essentially replacing the indexing service with one that's targeted for your specific need. If you only need to find filenames and not sizes (or sizes too but then you must cache them and use journal to notice changes), this approach could allow for practically instant search of tens of millions of files and folders if implemented ideally. There may be one or two paywares that have bothered with this. There's samples of both MFT (DiscUtils) and journal reading (google) in C# around. I only have about 5 million files and just using NTFSSearch is good enough for that amount as it takes about 10-20 seconds to search them. With journal reading added it would go down to <3 seconds for that amount.
I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.
This solution does not return FileInfo objects, but could be modified to do so—or potentially only return the relevant data needed from a custom FileInfo object.
This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.
P/Invoke:
/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
string lpFileName,
ref WIN32_FIND_DATA lpFindFileData
);
/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
IntPtr hFindFile,
ref WIN32_FIND_DATA lpFindFileData
);
/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
IntPtr hFindFile
);
Method:
public static Tuple<long, long> CountFilesDirectories(
string path,
CancellationToken token
)
{
if (String.IsNullOrWhiteSpace(path))
throw new ArgumentNullException("path", "The provided path is NULL or empty.");
// If the provided path doesn't end in a backslash, append one.
if (path.Last() != '\\')
path += '\\';
IntPtr hFile = IntPtr.Zero;
Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();
long files = 0;
long dirs = 0;
try
{
hFile = Win32.Kernel32.FindFirstFile(
path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
ref fd
);
// If we encounter an error, or there are no files/directories, we return no entries.
if (hFile.ToInt64() == -1)
return Tuple.Create<long, long>(0, 0);
//
// Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
//
List<string> directories = new List<string>();
do
{
// If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
// count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
(fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
fd.cFileName != "." && fd.cFileName != "..")
{
directories.Add(System.IO.Path.Combine(path, fd.cFileName));
dirs++;
}
// Otherwise, if this is a file ("archive"), increment the file count.
else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
{
files++;
}
}
while (Win32.Kernel32.FindNextFile(hFile, ref fd));
// Iterate over each discovered directory in parallel to maximize file/directory counting performance,
// calling itself recursively to traverse each directory completely.
Parallel.ForEach(
directories,
new ParallelOptions()
{
CancellationToken = token
},
directory =>
{
var count = CountFilesDirectories(
directory,
token
);
lock (directories)
{
files += count.Item1;
dirs += count.Item2;
}
});
}
catch (Exception)
{
// Handle as desired.
}
finally
{
if (hFile.ToInt64() != 0)
Win32.Kernel32.FindClose(hFile);
}
return Tuple.Create<long, long>(files, dirs);
}
On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.
This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.