I have a console application that is going to take about 625 days to complete. Unless there is a way to make it faster.
First off I am working in a directory that has a
You can move files in parallel and also using Directory.EnumerateFiles
gives you a lazy loaded list of files (of-course I have not tested it with 4,000,000 files):
var numberOfConcurrentMoves = 2;
var moves = new List<Task>();
var sourceDirectory = "source-directory";
var destinationDirectory = "destination-directory";
foreach (var filePath in Directory.EnumerateFiles(sourceDirectory))
{
var move = new Task(() =>
{
File.Move(filePath, Path.Combine(destinationDirectory, Path.GetFileName(filePath)));
//UPDATE DB
}, TaskCreationOptions.PreferFairness);
move.Start();
moves.Add(move);
if (moves.Count >= numberOfConcurrentMoves)
{
Task.WaitAll(moves.ToArray());
moves.Clear();
}
}
Task.WaitAll(moves.ToArray());
It turns out switching from File.Move to setting up a FileInfo and using .MoveTo increased the speed significantly.
It will run in about 35 days now as opposed to 625 days.
FileInfo fileinfo = new FileInfo(Path.Combine(location, fileName));
fileinfo.MoveTo(Path.Combine(rootDir, fileYear, fileMonth, fileName));
18 seconds isn't really unusual. NTFS does not perform well when you have a lot of files in a single directory. When you ask for a file, it has to do a linear search of its directory data structure. With 1,000 files, that doesn't take too long. With 10,000 files you notice it. With 4 million files . . . yeah, it takes a while.
You can probably do this even faster if you pre-load all of the directory entries into memory. Then rather than calling the FileInfo
constructor for each file, you just look it up in your dictionary.
Something like:
var dirInfo = new DirectoryInfo(path);
// get list of all files
var files = dirInfo.GetFileSystemInfos();
var cache = new Dictionary<string, FileSystemInfo>();
foreach (var f in files)
{
cache.Add(f.FullName, f);
}
Now when you get a name from the database, you can just look it up in the dictionary. That might very well be faster than trying to get it from the disk each time.