Parallel Framework and avoiding false sharing

微笑、不失礼 提交于 2019-12-03 16:57:39

问题


Recently, I had answered a question about optimizing a likely parallelizable method for generation every permutation of arbitrary base numbers. I posted an answer similar to the Parallelized, poor implementation code block list, and someone nearly immediately pointed this out:

This is pretty much guaranteed to give you false sharing and will probably be many times slower. (credit to gjvdkamp)

and they were right, it was death slow. That said, I researched the topic, and found some interesting material and suggestions (archived MSDN magazine only, .NET Matters: False Sharing) for combating it. If I understand it correctly, when threads access contiguous memory (in say, the array that's likely backing that ConcurrentStack), false sharing likely occurs.


For code below the horizontal rule, a Bytes is:

struct Bytes {
  public byte A; public byte B; public byte C; public byte D;
  public byte E; public byte F; public byte G; public byte H;
}

For my own testing, I wanted to get a parallel version of this running and be genuinely faster, so I created a simple example based on the original code. 6 as limits[0] was a lazy choice on my part - my computer has 6 cores.

Single thread block Average run time: 10s0059ms

  var data = new List<Bytes>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  for (byte a = 0; a < limits[0]; a++)
  for (byte b = 0; b < limits[1]; b++)
  for (byte c = 0; c < limits[2]; c++)
  for (byte d = 0; d < limits[3]; d++)
  for (byte e = 0; e < limits[4]; e++)
  for (byte f = 0; f < limits[5]; f++)
  for (byte g = 0; g < limits[6]; g++)
  for (byte h = 0; h < limits[7]; h++)
    data.Add(new Bytes {
      A = a, B = b, C = c, D = d, 
      E = e, F = f, G = g, H = h
    });

Parallelized, poor implementation Run time avg: 81s729ms, ~ 8700 contentions

  var data = new ConcurrentStack<Bytes>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  Parallel.For(0, limits[0], (a) => {
    for (byte b = 0; b < limits[1]; b++)
    for (byte c = 0; c < limits[2]; c++)
    for (byte d = 0; d < limits[3]; d++)
    for (byte e = 0; e < limits[4]; e++)
    for (byte f = 0; f < limits[5]; f++)
    for (byte g = 0; g < limits[6]; g++)
    for (byte h = 0; h < limits[7]; h++)
      data.Push(new Bytes {
        A = (byte)a,B = b,C = c,D = d,
        E = e,F = f,G = g,H = h
      });
  }); 

Parallelized, ?? implementation Run time avg: 5s833ms, 92 contentions

  var data = new ConcurrentStack<List<Bytes>>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  Parallel.For (0, limits[0], () => new List<Bytes>(), 
    (a, loop, localList) => { 
      for (byte b = 0; b < limits[1]; b++)
      for (byte c = 0; c < limits[2]; c++)
      for (byte d = 0; d < limits[3]; d++)
      for (byte e = 0; e < limits[4]; e++)
      for (byte f = 0; f < limits[5]; f++)
      for (byte g = 0; g < limits[6]; g++)
      for (byte h = 0; h < limits[7]; h++)
        localList.Add(new Bytes {
          A = (byte)a, B = b, C = c, D = d,
          E = e, F = f, G = g, H = h
        });
      return localList;
  }, x => {
    data.Push(x);
  });

I'm glad that I had got an implementation that is faster than the single threaded version. I expected a result closer to around 10s / 6, or around 1.6 seconds, but that's probably a naive expectation.

My question is for the parallelized implementation that is actually faster than the single-threaded version, are there further optimizations that could be a applied to the operation? I'm wondering about optimizations related to parallelization, not improvements to the algorithm used to compute the values. Specifically:

  • I know about the optimization to store and populate as a struct instead of byte[], but it's not related to parallelization (or is it?)
  • I know that a desired value could be lazy evaluated with a ripple-carry adder, but same as the struct optimization.

回答1:


First off, my initial assumption regarding Parallel.For() and Parallel.ForEach() was wrong.

The poor parallel implementation very likely has 6 threads all attempting to write to a single CouncurrentStack() at once. The good implementation usuing thread locals (explained more below) only accesses the shared variable once per task, nearly eliminating any contention.

When using Parallel.For() and Parallel.ForEach(), you cannot simply in-line replace a for or foreach loop with them. That's not to say it couldn't be a blind improvement, but without examining the problem and instrumenting it, using them is throwing multithreading at a problem because it might make it faster.

**Parallel.For() and Parallel.ForEach() has overloads that allow you to create a local state for the Task they ultimately create, and run an expression before and after each iteration's execution.

If you have an operation you parallelize with Parallel.For() or Parallel.ForEach(), it's likely a good idea to use this overload:

public static ParallelLoopResult For<TLocal>(
    int fromInclusive,
    int toExclusive,
    Func<TLocal> localInit,
    Func<int, ParallelLoopState, TLocal, TLocal> body,
    Action<TLocal> localFinally
)

For example, calling For() to sum all integers from 1 to 100,

var total = 0;

Parallel.For(0, 101, () => 0,  // <-- localInit
(i, state, localTotal) => { // <-- body
  localTotal += i;
  return localTotal;
}, localTotal => { <-- localFinally
  Interlocked.Add(ref total, localTotal);
});

Console.WriteLine(total);

localInit should be an lambda that initializes the local state type, which is passed to the body and localFinally lambdas. Please note I am not recommending implementing summing 1 to 100 using parallelization, but just have a simple example to make the example short.



来源:https://stackoverflow.com/questions/29814108/parallel-framework-and-avoiding-false-sharing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!