Dataflow with splitting work to small jobs and then group again

后端 未结 3 1059
终归单人心
终归单人心 2020-12-31 09:36

I need to do this kind of work:

  1. Get Page object from database
  2. For each page get all images and process them (IO bound, for example, upload to CDN)
相关标签:
3条回答
  • 2020-12-31 09:49

    The .NET platform has a nice interface that can represent parent-child relationships, the IGrouping<TKey, TElement> interface. It is simply an IEnumerable that also has a Key property. The key can be anything, and in this case in could be the Page that needs to be processed. The contents of the grouping could be the Images that belong to each page, and need to be uploaded. This leads to the idea of a dataflow block that can process IGrouping<TKey, TInput> objects, by processing independently each TInput, then aggegate the results per grouping, and finally output them as IGrouping<TKey, TOutput> objects. Below is an implementation of this idea:

    public static TransformBlock<IGrouping<TKey, TInput>, IGrouping<TKey, TOutput>>
        CreateTransformGroupingBlock<TKey, TInput, TOutput>(
            Func<TKey, TInput, Task<TOutput>> transform,
            ExecutionDataflowBlockOptions dataflowBlockOptions = null)
    {
        if (transform == null) throw new ArgumentNullException(nameof(transform));
        dataflowBlockOptions ??= new ExecutionDataflowBlockOptions();
    
        var actionBlock = new ActionBlock<Task<Task<TOutput>>>(taskTask =>
        {
            // An exception thrown by the following line would cause buggy behavior.
            // According to the documentation it should never fail.
            taskTask.RunSynchronously();
            return taskTask.Unwrap();
        }, dataflowBlockOptions);
    
        var completionCTS = new CancellationTokenSource();
        _ = actionBlock.Completion
            .ContinueWith(_ => completionCTS.Cancel(), TaskScheduler.Default);
    
        var transformBlock = new TransformBlock<IGrouping<TKey, TInput>,
            IGrouping<TKey, TOutput>>(async grouping =>
        {
            if (grouping == null) throw new InvalidOperationException("Null grouping.");
            var tasks = new List<Task<TOutput>>();
            foreach (var item in grouping)
            {
                // Create a cold task that will be either executed by the actionBlock,
                // or will be canceled by the completionCTS. This should eliminate
                // any possibility that an awaited task will remain cold forever.
                var taskTask = new Task<Task<TOutput>>(() => transform(grouping.Key, item),
                    completionCTS.Token);
                var accepted = await actionBlock.SendAsync(taskTask);
                if (!accepted)
                {
                    // The actionBlock has failed.
                    // Skip the rest of the items. Pending tasks should still be awaited.
                    tasks.Add(Task.FromCanceled<TOutput>(new CancellationToken(true)));
                    break;
                }
                tasks.Add(taskTask.Unwrap());
            }
            TOutput[] results = await Task.WhenAll(tasks);
            return results.GroupBy(_ => grouping.Key).Single(); // Convert to IGrouping
        }, dataflowBlockOptions);
    
        // Cleanup
        _ = transformBlock.Completion
            .ContinueWith(_ => actionBlock.Complete(), TaskScheduler.Default);
        _ = Task.WhenAll(actionBlock.Completion, transformBlock.Completion)
            .ContinueWith(_ => completionCTS.Dispose(), TaskScheduler.Default);
    
        return transformBlock;
    }
    
    // Overload with synchronous lambda
    public static TransformBlock<IGrouping<TKey, TInput>, IGrouping<TKey, TOutput>>
        CreateTransformGroupingBlock<TKey, TInput, TOutput>(
            Func<TKey, TInput, TOutput> transform,
            ExecutionDataflowBlockOptions dataflowBlockOptions = null)
    {
        if (transform == null) throw new ArgumentNullException(nameof(transform));
        return CreateTransformGroupingBlock<TKey, TInput, TOutput>(
            (key, item) => Task.FromResult(transform(key, item)), dataflowBlockOptions);
    }
    

    This implementation consists of two blocks, a TransformBlock that processes the groupings and an internal ActionBlock that processes the individual items. Both are configured with the same user-supplied options. The TransformBlock sends to the ActionBlock the items to be processed one by one, then waits for the results, and finally constructs the output IGrouping<TKey, TOutput> with the following tricky line:

    return results.GroupBy(_ => grouping.Key).Single(); // Convert to IGrouping
    

    This compensates for the fact that currently there is no publicly available class that implements the IGrouping interface, in the .NET platform. The GroupBy+Single combo does the trick, but it has the limitation that it doesn't allow the creation of empty IGroupings. In case this is an issue, creating a class that implements this interface is always an option. Implementing one is quite straightforward (here is an example).

    Usage example of the CreateTransformGroupingBlock method:

    var processPages = new TransformBlock<Page, IGrouping<Page, Image>>(page =>
    {
        Image[] images = GetImagesFromDB(page);
        return images.GroupBy(_ => page).Single(); // Convert to IGrouping
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 });
    
    var uploadImages = CreateTransformGroupingBlock<Page, Image, Image>(async (page, image) =>
    {
        await UploadImage(image);
        return image;
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 });
    
    var savePages = new ActionBlock<IGrouping<Page, Image>>(grouping =>
    {
        var page = grouping.Key;
        foreach (var image in grouping) SaveImageToDB(image, page);
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
    
    processPages.LinkTo(uploadImages);
    uploadImages.LinkTo(savePages);
    

    The type of the uploadImages variable is TransformBlock<IGrouping<Page, Image>, IGrouping<Page, Image>>. In this example the types TInput and TOutput are the same, because the images need not to be transformed.

    0 讨论(0)
  • 2020-12-31 09:56

    You can group the images together by recording whenever an image for a given page arrives and then sending the page on when all images arrived. To figure that out, page needs to know how many images it contains, but I assume you know that.

    In code, it could look something like this:

    public static IPropagatorBlock<TSplit, TMerged>
        CreaterMergerBlock<TSplit, TMerged>(
        Func<TSplit, TMerged> getMergedFunc, Func<TMerged, int> getSplitCount)
    {
        var dictionary = new Dictionary<TMerged, int>();
    
        return new TransformManyBlock<TSplit, TMerged>(
            split =>
            {
                var merged = getMergedFunc(split);
                int count;
                dictionary.TryGetValue(merged, out count);
                count++;
                if (getSplitCount(merged) == count)
                {
                    dictionary.Remove(merged);
                    return new[] { merged };
                }
    
                dictionary[merged] = count;
                return new TMerged[0];
            });
    }
    

    Usage:

    var dataPipe = new BufferBlock<Page>();
    
    var splitter = new TransformManyBlock<Page, ImageWithPage>(
        page => page.LoadImages(),
        new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 });
    
    var processImage = new TransformBlock<ImageWithPage, ImageWithPage>(
        image =>
        {
            // process the image here
            return image;
        }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 });
    
    var merger = CreaterMergerBlock(
        (ImageWithPage image) => image.Page, page => page.ImageCount);
    
    var savePage = new ActionBlock<Page>(
        page => /* save the page here */,
        new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
    
    dataPipe.LinkTo(splitter);
    splitter.LinkTo(processImage);
    processImage.LinkTo(merger);
    merger.LinkTo(savePage);
    
    0 讨论(0)
  • 2020-12-31 10:09

    Consider merging "Load images" and "Process images" into one TransformBlock block. That way you have no trouble keeping the images of a single page together.

    In order to achieve your concurrency limit goal, use a SemaphoreSlim:

    SemaphoreSlim processImageDopLimiter = new SemaphoreSlim(8);
    
    //...
    
    var page = ...; //TransformBlock<Page, MyPageAndImageDTO> block input
    var images = GetImages(page);
    ImageWithPage[] processedImages =
     images
     .AsParallel()
     .Select(i => {
        processImageDopLimiter.WaitOne();
        var result = ProcessImage(i);
        processImageDopLimiter.ReleaseOne();
        return result;
     })
     .ToList();
    return new { page, processedImages };
    

    This will lead to quite a few threads blocked waiting. You can use an asynchronous version of this processing if you like. This is immaterial to the question.

    0 讨论(0)
提交回复
热议问题