I\'m writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a
Cool question.
I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%
I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.
Here is the meat of it
private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
var rootDir = new DirectoryInfo(rootFolderPath);
var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);
return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
select directoriesWithFiles;
}
private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
foreach (DirectoryInfo dir in dirList)
{
yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
}
}
I'd be inclined to return an IEnumerable<> in this case -- depending on how you are consuming the results, it could be an improvement, plus you reduce your parameter footprint by 1/3 and avoid passing around that List incessantly.
private IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
DirectoryInfo di = new DirectoryInfo(rootFolderPath);
var fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
foreach (FileInfo fi in fiArr)
{
yield return fi;
}
var diArr = di.GetDirectories();
foreach (DirectoryInfo di in diArr)
{
var nextRound = GetFileList(fileSearchPattern, di.FullnName);
foreach (FileInfo fi in nextRound)
{
yield return fi;
}
}
yield break;
}
Another idea would be to spin off BackgroundWorker
objects to troll through directories. You wouldn't want a new thread for every directory, but you might create them on the top level (first pass through GetFileList()
), so if you execute on your C:\
drive, with 12 directories, each of those directories will be searched by a different thread, which will then recurse through subdirectories. You'll have one thread going through C:\Windows
while another goes through C:\Program Files
. There are a lot of variables as to how this is going to affect performance -- you'd have to test it to see.
You can use parallel foreach (.Net 4.0) or you can try Poor Man's Parallel.ForEach Iterator for .Net3.5 . That can speed-up your search.
Try Parallel programming:
private string _fileSearchPattern;
private List<string> _files;
private object lockThis = new object();
public List<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
_fileSearchPattern = fileSearchPattern;
AddFileList(rootFolderPath);
return _files;
}
private void AddFileList(string rootFolderPath)
{
var files = Directory.GetFiles(rootFolderPath, _fileSearchPattern);
lock (lockThis)
{
_files.AddRange(files);
}
var directories = Directory.GetDirectories(rootFolderPath);
Parallel.ForEach(directories, AddFileList); // same as Parallel.ForEach(directories, directory => AddFileList(directory));
}
This takes 30 seconds to get 2 million file names that meet the filter. The reason this is so fast is because I am only performing 1 enumeration. Each additional enumeration affects performance. The variable length is open to your interpretation and not necessarily related to the enumeration example.
if (Directory.Exists(path))
{
files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".xml") || s.EndsWith(".csv"))
.Select(s => s.Remove(0, length)).ToList(); // Remove the Dir info.
}
I had the same problem. Here is my attempt which is a lot faster than calling Directory.EnumerateFiles, Directory.EnumerateDirectories or Directory.EnumerateFileSystemEntries recursive:
public static IEnumerable<string> EnumerateDirectoriesRecursive(string directoryPath)
{
return EnumerateFileSystemEntries(directoryPath).Where(e => e.isDirectory).Select(e => e.EntryPath);
}
public static IEnumerable<string> EnumerateFilesRecursive(string directoryPath)
{
return EnumerateFileSystemEntries(directoryPath).Where(e => !e.isDirectory).Select(e => e.EntryPath);
}
public static IEnumerable<(string EntryPath, bool isDirectory)> EnumerateFileSystemEntries(string directoryPath)
{
Stack<string> directoryStack = new Stack<string>(new[] { directoryPath });
while (directoryStack.Any())
{
foreach (string fileSystemEntry in Directory.EnumerateFileSystemEntries(directoryStack.Pop()))
{
bool isDirectory = (File.GetAttributes(fileSystemEntry) & (FileAttributes.Directory | FileAttributes.ReparsePoint)) == FileAttributes.Directory;
yield return (fileSystemEntry, isDirectory);
if (isDirectory)
directoryStack.Push(fileSystemEntry);
}
}
}
You can modify the code to search for specific files or directories easily.
Regards