问题
I have a requirement to fetch blob files from Azure storage, read through them, get the data and process it, and store it into a database. The number of data fetched from blob is high, i.e. around 40K records per file. There are 70 files like this in a folder.
This is how I designed it:
- I use
Parallel.Foreach
on list of blob files with max parallelism 4. - In each loop, I fetch stream a blob (
OpenRead
method), read through it and fill a datatable. If the datatable size is 10000, I will callSqlBulkCopy
and insert the data into the database.
In one folder of blob there are 70 files.
Parallel.Foreach {
// Stream blob file
// Create a datatable
foreach item in file
{
AddToDatatable
if(datatable > 5000)
{
BulkCopy to DB.
Clear datatable
}
}
// Dispose datatable
}
Some observations I found is, when I increase the parallel count the time taken to process one file increase. Is it because I'm opening multiple blob stream in parallel? Also multiple parallel causes more data to be stored in memory at a time.
I would like to know 2 things:
I would like to try a different design where I can keep a single datatable and fill it from the parallel foreach. Then if it reaches 10K records, I should store in DB and clear. I don't know how to implement it.
If there's a better approach in terms of processing the files faster.
回答1:
Your current approach is quite logical. It is not optimal though, because each parallel workflow is composed of heterogeneous jobs, that are not coordinated with the other workflows. For example it is entirely possible that at a given moment all four parallel workflows are fetching data from Azure, at another moment all four are constructing datatables from raw data, and another moment all four are waiting for a response from the database.
All these heterogeneous jobs have different characteristics. For example the interaction with the database may not be parallelizable, and sending 4 concurrent SqlBulkCopy
commands to the database may be actually slower than sending them the one after the other. On the other hand creating datatables in memory is probably highly parallelizable, and fetching data from Azure may be benefited by parallelism only slightly (because the bottleneck could be the speed of your internet connection, and not the speed of the Azure servers). It is quite certain though that you could achieve a performance boost between 2x-3x by just making sure that at any given moment all heterogeneous jobs are in progress. This is called task-parallelism, in contrast to the simpler data-parallelism (your current setup).
To achieve task-parallelism you need to create a pipeline, where the data are flowing from the one processing block to the next, until they reach the final block. In your case you probably need 3 blocks:
- Download files from Azure and split them to raw records.
- Parse the records and push the parsed data to datatables.
- Send the datatables to the database for storage.
Sending single records from the first block to the second block may not be optimal, because the parallelism has overhead, and the more granular the workload the more overhead it creates. So ideally you would need to chunkify the workload, and batch the records to arrays before sending them to the next block. All this can be implemented with a great tool that is designed for exactly this kind of job, the TPL Dataflow library. It has blocks for transforming, batching, unbatching and whatnot. It is also very flexible and feature-rich regarding the options it offers. But since it has some learning curve, I have something more familiar to suggest as the infrastructure for the pipeline: the PLINQ library.
Any time you add the AsParallel
operator to a query, a new processing block is started. To force the data to flow to the next block as fast as possible, the WithMergeOptions(ParallelMergeOptions.NotBuffered)
operator is needed. For controlling the degree of parallelism there is the WithDegreeOfParallelism
, and to keep them in the original order there is the AsOrdered
. Lets combine all these in a single extension method for convenience, to avoid repeating them over and over again:
public static ParallelQuery<TSource> BeginPipelineBlock<TSource>(
this IEnumerable<TSource> source, int degreeOfParallelism)
{
return Partitioner
.Create(source, EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(degreeOfParallelism)
.WithMergeOptions(ParallelMergeOptions.NotBuffered);
}
The reason for the Partitioner
configured with NoBuffering is for ensuring that the PLINQ will enumerate the source
in its natural order, one item at a time. Without it the PLINQ utilizes some fancy partitioning strategies, that are not suitable for this usage.
Now your pipeline can be constructed fluently like this:
files
.BeginPipelineBlock(degreeOfParallelism: 2)
.SelectMany(file => DownloadFileRecords(file))
.Buffer(1000)
.BeginPipelineBlock(degreeOfParallelism: 3)
.Select(batch => CreateDataTable(batch))
.BeginPipelineBlock(degreeOfParallelism: 1)
.ForAll(dataTable => SaveDataTable(dataTable));
The Buffer
operator exists in the System.Interactive package, and combines single records into batches:
public static IEnumerable<IList<TSource>> Buffer<TSource>(
this IEnumerable<TSource> source, int count);
An operator named Batch with the same functionality exists in the MoreLinq package. If you don't want the dependency you can grab the source code and embed it directly to your project.
Important: If you use the above technique to build the pipeline, you should avoid configuring two consecutive blocks with degreeOfParallelism: 1
. This is because of how PLINQ works. This library does not depend only on background threads, but it also uses the current thread as a worker thread. So if two (or more) consecutive pipeline blocks are configured with degreeOfParallelism: 1
, they will all attempt to execute their workload in the current thread, blocking each other, and defeating the whole purpose of task-parallelism.
This shows that this library is not intended to be used as a pipeline infrastructure, and using it as such imposes some limitations. So if it makes sense for your pipeline to have consecutive blocks with degreeOfParallelism: 1
, the PLINQ becomes not a viable option, and you should look for alternatives. Like the aforementioned TPL Dataflow library.
Update: It is actually possible to link consecutive blocks having degreeOfParallelism: 1
, without squeezing them into a single thread, by offloading the enumeration of the source
to another thread. This way each block will run on a different thread. Below is an implementation of a OffloadEnumeration
method, that is based on a Channel<T>:
private static IEnumerable<T> OffloadEnumeration<T>(
IEnumerable<T> source, int boundedCapacity)
{
var channel = Channel.CreateBounded<T>(boundedCapacity);
var cts = new CancellationTokenSource();
var task = Task.Run(async () =>
{
try
{
foreach (var item in source)
while (!channel.Writer.TryWrite(item))
if (!await channel.Writer.WaitToWriteAsync(cts.Token))
throw new ChannelClosedException(); // Should never happen
channel.Writer.Complete();
}
catch (Exception ex) { channel.Writer.Complete(ex); }
});
try
{
while (channel.Reader.WaitToReadAsync().AsTask().GetAwaiter().GetResult())
while (channel.Reader.TryRead(out var item))
yield return item;
}
finally { cts.Cancel(); }
}
This method should be invoked at the beginning of each block:
public static ParallelQuery<TSource> BeginPipelineBlock<TSource>(
this IEnumerable<TSource> source, int degreeOfParallelism)
{
source = OffloadEnumeration(source, degreeOfParallelism * 10);
return Partitioner
.Create(source, EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(degreeOfParallelism)
.WithMergeOptions(ParallelMergeOptions.NotBuffered);
}
This is really only useful when the previous block has degreeOfParallelism: 1
, but calling it always shouldn't add much overhead (assuming that the workload of each block is fairly chunky).
来源:https://stackoverflow.com/questions/62035864/design-help-for-parallel-processing-azure-blob-and-bulk-copy-to-sql-database-c