问题
I have 2 loops(nested), trying to do a simple parallelisation
pseudocode:
for item1 in data1 (~100 million row)
for item2 in data2 (~100 rows)
result = process(item1,item2) // couple of if conditions
hashset.add(result) // while adding, incase of a duplicate i also decide wihch one to retain
process(item1,item2)
to be precise has 4 if conditions bases on values in item1 and item2.(time taken is less than 50ms)
data1
size is Nx17data2
size is Nx17result
size is 1x17 (result is joined into a string before it is added into hashset)
max output size: unknown, but i would like to be ready for atleast 500 million which means the hashset would be holding 500 million items. (how to handle so much data in a hashset would be an another question i guess)
Should i just use a concurrent hashset
to make it thread safe and go with parallel.each
or should i go with TASK
concept
Please provide some code samples based on your opinion.
回答1:
The answer depends a lot on the costs of process(data1, data2)
. If this is a CPU-intensive operation, then you can surely benefit from Parallel.ForEach
. Of course, you should use concurrent dictionary, or lock around your hash table. You should benchmark to see what works best for you. If process
has too little impact on performance, then probably you will get nothing from the parallelization - the locking on the hashtable will kill it all.
You should also try to see whether enumerating data2 on the outer loop is also faster. It might give you another benefit - you can have a separate hashtable for each instance of data2 and then merge the results into one hashtable. This will avoid locks.
Again, you need to do your tests, there is no universal answer here.
回答2:
My suggestion is to separate the processing of the data from the saving of the results to the HashSet
, because the first is parallelizable but the second is not. You could achieve this separation with the producer-consumer pattern, using a BlockingCollection and threads (or tasks). But I'll show a solution using a more specialized tool, the TPL Dataflow library. I'll assume that the data are two arrays of integers, and the processing function can produce up to 500,000,000 different results:
var data1 = Enumerable.Range(1, 100_000_000).ToArray();
var data2 = Enumerable.Range(1, 100).ToArray();
static int Process(int item1, int item2)
{
return unchecked(item1 * item2) % 500_000_000;
}
The dataflow pipeline will have two blocks. The first block is a TransformBlock that accepts an item from the data1
array, processes it with all items of the data2
array, and returns a batch of the results (as an int
array).
var processBlock = new TransformBlock<int, int[]>(item1 =>
{
int[] batch = new int[data2.Length];
for (int j = 0; j < data2.Length; j++)
{
batch[j] = Process(item1, data2[j]);
}
return batch;
}, new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 100,
MaxDegreeOfParallelism = 3 // Configurable
});
The second block is and ActionBlock that receives the processed batches from the first block, and adds the individual results in the HashSet
.
var results = new HashSet<int>();
var saveBlock = new ActionBlock<int[]>(batch =>
{
for (int i = 0; i < batch.Length; i++)
{
results.Add(batch[i]);
}
}, new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 100,
MaxDegreeOfParallelism = 1 // Mandatory
});
The line below links the two blocks together, so that the data will flow automatically from the first block to the second:
processBlock.LinkTo(saveBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
The last step is to feed the first block with the items of the data1
array, and wait for the completion of the whole operation.
for (int i = 0; i < data1.Length; i++)
{
processBlock.SendAsync(data1[i]).Wait();
}
processBlock.Complete();
saveBlock.Completion.Wait();
The HashSet
contains now the results.
A note about using the BoundedCapacity option. This option controls the flow of the data, so that a fast block upstream will not flood with data a slow block downstream. Configuring properly this option increases the memory and CPU efficiency of the pipeline.
The TPL Dataflow library is built-in the .NET Core, and available as a package for .NET Framework.
来源:https://stackoverflow.com/questions/61456516/simple-parallelisation-for-hashset