I have a base directory that contains several thousand folders. Inside of these folders there can be between 1 and 20 subfolders that contains between 1 and 10 files. I\'d
A possibly faster alternative is to use WINAPI FindNextFile. There is an excellent Faster Directory Enumeration Tool for this. Which can be used as follows:
HashSet<FileData> GetPast60(string dir)
{
DateTime retval = DateTime.Now.AddDays(-60);
HashSet<FileData> oldFiles = new HashSet<FileData>();
FileData [] files = FastDirectoryEnumerator.GetFiles(dir);
for (int i=0; i<files.Length; i++)
{
if (files[i].LastWriteTime < retval)
{
oldFiles.Add(files[i]);
}
}
return oldFiles;
}
So, based on comments below, I decided to do a benchmark of suggested solutions here as well as others I could think of. It was interesting enough to see that EnumerateFiles seemed to out-perform FindNextFile in C#, while EnumerateFiles
with AsParallel
was by far the fastest followed surprisingly by command prompt count. However do note that AsParallel
wasn't getting the complete file count or was missing some files counted by the others so you could say the command prompt method is the best.
Applicable Config:
Below are three screenshots:
I have included my test code below:
static void Main(string[] args)
{
Console.Title = "File Enumeration Performance Comparison";
Stopwatch watch = new Stopwatch();
watch.Start();
var allfiles = GetPast60("C:\\Users\\UserName\\Documents");
watch.Stop();
Console.WriteLine("Total time to enumerate using WINAPI =" + watch.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles);
Stopwatch watch1 = new Stopwatch();
watch1.Start();
var allfiles1 = GetPast60Enum("C:\\Users\\UserName\\Documents\\");
watch1.Stop();
Console.WriteLine("Total time to enumerate using EnumerateFiles =" + watch1.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles1);
Stopwatch watch2 = new Stopwatch();
watch2.Start();
var allfiles2 = Get1("C:\\Users\\UserName\\Documents\\");
watch2.Stop();
Console.WriteLine("Total time to enumerate using Get1 =" + watch2.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles2);
Stopwatch watch3 = new Stopwatch();
watch3.Start();
var allfiles3 = Get2("C:\\Users\\UserName\\Documents\\");
watch3.Stop();
Console.WriteLine("Total time to enumerate using Get2 =" + watch3.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles3);
Stopwatch watch4 = new Stopwatch();
watch4.Start();
var allfiles4 = RunCommand(@"dir /a: /b /s C:\Users\UserName\Documents");
watch4.Stop();
Console.WriteLine("Total time to enumerate using Command Prompt =" + watch4.ElapsedMilliseconds + "ms.");
Console.WriteLine("File Count: " + allfiles4);
Console.WriteLine("Press Any Key to Continue...");
Console.ReadLine();
}
private static int RunCommand(string command)
{
var process = new Process()
{
StartInfo = new ProcessStartInfo("cmd")
{
UseShellExecute = false,
RedirectStandardInput = true,
RedirectStandardOutput = true,
CreateNoWindow = true,
Arguments = String.Format("/c \"{0}\"", command),
}
};
int count = 0;
process.OutputDataReceived += delegate { count++; };
process.Start();
process.BeginOutputReadLine();
process.WaitForExit();
return count;
}
static int GetPast60Enum(string dir)
{
return new DirectoryInfo(dir).EnumerateFiles("*.*", SearchOption.AllDirectories).Count();
}
private static int Get2(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel().Count();
}
private static int Get1(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
.Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
private static int GetPast60(string dir)
{
return FastDirectoryEnumerator.GetFiles(dir, "*.*", SearchOption.AllDirectories).Length;
}
NB: I concentrated on count in the benchmark not modified date.
The method Get1 in above answer (#itsnotalie & #Chibueze Opata) is missing to count the files in the root directory, so it should read:
private static int Get1(string myBaseDirectory)
{
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
return dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
.Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
This is (probably) as good as it's going to get:
DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel()
.Where(fi => fi.CreationTime < sixtyLess).ToArray();
Changes:
DateTime
constant, and therefore less CPU load.EnumerateFiles
.Should run in a smaller amount of time (not sure how much smaller).
Here is another solution which might be faster or slower than the first, it depends on the data:
DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories)
.Where(fi => fi.CreationTime < sixtyLess))
.ToArray();
Here it moves the parallelism to the main folder enumeration. Most of the changes from above apply too.
If you really want to improve performance, get your hands dirty and use the NtQueryDirectoryFile
that's internal to Windows, with a large buffer size.
FindFirstFile
is already slow, and while FindFirstFileEx
is a bit better, the best performance will come from calling the native function directly.
I realize this is very late to the party but if someone else is looking for this then you can speed things up by orders of magnitude by directly parsing the the MFT or FAT of the file system, this requires admin privileges as I think it will return all files regardless of security but can probably take your 30 mins down to 30 seconds for the enumeration stage at least.
A library for NTFS is here https://github.com/LordMike/NtfsLib there is also https://discutils.codeplex.com/ which I haven't personally used.
I would only use these methods for initial discovery of files over x days old and then verify them individual before deleting, it might be overkill but I'm cautious like that.
When using SearchOption.AllDirectories
EnumerateFiles
took ages to return the first item. After reading several good answers here, I have for now ended up with the function below. By only have it work on one directory at a time and calling it recursively it now returns first item almost immediately.
But I must admit that I'm not totally sure on the correct way to use .AsParallel()
so don't use this blindly.
Instead of working with arrays I would strongly suggest working with enumeration instead. Some mentions that speed of disk is limiting factor and threads won't help, in terms of total time that is very likely as long as nothing is cached by the OS, but by using multiple threads you can get the cached data returned first, while otherwise it might be possible that the cache is pruned to make space for the new results.
Recursive calls might affect stack, but there is a limit on most FSs for how many levels there can be, so should not become a real issue.
private static IEnumerable<FileInfo> EnumerateFilesParallel(DirectoryInfo dir)
{
return dir.EnumerateDirectories()
.AsParallel()
.SelectMany(EnumerateFilesParallel)
.Concat(dir.EnumerateFiles("*", SearchOption.TopDirectoryOnly).AsParallel());
}