I need to copy files from one directory to another, depending on the existence of the file name in a table of SQL database.
For this I use the following code:
<Allow me to make a guess - Mmmmm... No. There is no way to do it faster.
How come I am so confident? Because file copying requires talking to disk and that is a horribly slow operation. Even further, if you try to go for multi-threading, the results will go slower instead of faster because the 'mechanical' operation of moving the head over the disk isn't sequential anymore, which may have been earlier by chance.
See answers to this question I asked earlier.
So yeah, try going to SSDs if you aren't yet using them, otherwise you are getting the best already.
Below here is something for us to put into perspective what does slow mean in disk writing when compared to caches. If cache access is taking 10 min., it implies that it takes 2 years to read from disk. All the accesses are shown in the image below. Clearly when your code will execute, the bottleneck will be disk writes. The best you can do it to let the disk writes stay sequential.
File.Copy is as fast as it gets. You must keep in mind that you depend on the file transfer speed dictated by your hardware and at 20000 files, the latency for data access also comes into play. If you are doing this on a HDD, you could see a big improvement after switching to SSD or some other fast medium.
For this case alone, most likely the hardware is your bottleneck.
EDIT: I consider keeping the connection to the database open for such a long time as a bad practice. I suggest you fetch all the needed data in some in-memory cache (array, list, whatever) and then iterate through that as you copy the files. A db connection is a precious resource and on applications that must handle high concurrency (but not only), releasing the connection fast is a must.
Since your i/o subsystem is almost certainly the botteneck here, using the parallel task library is probably about as good as it gets:
static void Main(string[] args)
{
DirectoryInfo source = new DirectoryInfo( args[0] ) ;
DirectoryInfo destination = new DirectoryInfo( args[1] ) ;
HashSet<string> filesToBeCopied = new HashSet<string>( ReadFileNamesFromDatabase() , StringComparer.OrdinalIgnoreCase ) ;
// you'll probably have to play with MaxDegreeOfParallellism so as to avoid swamping the i/o system
ParallelOptions options= new ParallelOptions { MaxDegreeOfParallelism = 4 } ;
Parallel.ForEach( filesToBeCopied.SelectMany( fn => source.EnumerateFiles( fn ) ) , options , fi => {
string destinationPath = Path.Combine( destination.FullName , Path.ChangeExtension( fi.Name , ".jpg") ) ;
fi.CopyTo( destinationPath , false ) ;
}) ;
}
public static IEnumerable<string> ReadFileNamesFromDatabase()
{
using ( SqlConnection connection = new SqlConnection( "connection-string" ) )
using ( SqlCommand cmd = connection.CreateCommand() )
{
cmd.CommandType = CommandType.Text ;
cmd.CommandText = @"
select idPic ,
namePicFile
from DocPicFiles
" ;
connection.Open() ;
using ( SqlDataReader reader = cmd.ExecuteReader() )
{
while ( reader.Read() )
{
yield return reader.GetString(1) ;
}
}
connection.Close() ;
}
}
I addressed this problem by creating a single compressed file (.zip) using the parameter to just store the file (no compression). Creating the single (.zip) file, moving that single file, then expanding at the location proved to be 2x faster when dealing with thousands of files.