Math optimization in C#

后端 未结 25 2194
悲&欢浪女
悲&欢浪女 2020-12-07 10:25

I\'ve been profiling an application all day long and, having optimized a couple bits of code, I\'m left with this on my todo list. It\'s the activation function for a neural

相关标签:
25条回答
  • 2020-12-07 11:04

    FWIW, here's my C# benchmarks for the answers already posted. (Empty is a function that just returns 0, to measure the function call overhead)

    Empty Function:       79ms   0
    Original:             1576ms 0.7202294
    Simplified: (soprano) 681ms  0.7202294
    Approximate: (Neil)   441ms  0.7198783
    Bit Manip: (martinus) 836ms  0.72318
    Taylor: (Rex Logan)   261ms  0.7202305
    Lookup: (Henrik)      182ms  0.7204863
    
    public static object[] Time(Func<double, float> f) {
        var testvalue = 0.9456;
        var sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < 1e7; i++)
            f(testvalue);
        return new object[] { sw.ElapsedMilliseconds, f(testvalue) };
    }
    public static void Main(string[] args) {
        Console.WriteLine("Empty:       {0,10}ms {1}", Time(Empty));
        Console.WriteLine("Original:    {0,10}ms {1}", Time(Original));
        Console.WriteLine("Simplified:  {0,10}ms {1}", Time(Simplified));
        Console.WriteLine("Approximate: {0,10}ms {1}", Time(ExpApproximation));
        Console.WriteLine("Bit Manip:   {0,10}ms {1}", Time(BitBashing));
        Console.WriteLine("Taylor:      {0,10}ms {1}", Time(TaylorExpansion));
        Console.WriteLine("Lookup:      {0,10}ms {1}", Time(LUT));
    }
    
    0 讨论(0)
  • 2020-12-07 11:05

    Soprano had some nice optimizations your call:

    public static float Sigmoid(double value) 
    {
        float k = Math.Exp(value);
        return k / (1.0f + k);
    }
    

    If you try a lookup table and find it uses too much memory you could always looking at the value of your parameter for each successive calls and employing some caching technique.

    For example try caching the last value and result. If the next call has the same value as the previous one, you don't need to calculate it as you'd have cached the last result. If the current call was the same as the previous call even 1 out of a 100 times, you could potentially save yourself 1 million calculations.

    Or, you may find that within 10 successive calls, the value parameter is on average the same 2 times, so you could then try caching the last 10 values/answers.

    0 讨论(0)
  • 2020-12-07 11:05

    Off the top of my head, this paper explains a way of approximating the exponential by abusing floating point, (click the link in the top right for PDF) but I don't know if it'll be of much use to you in .NET.

    Also, another point: for the purpose of training large networks quickly, the logistic sigmoid you're using is pretty terrible. See section 4.4 of Efficient Backprop by LeCun et al and use something zero-centered (actually, read that whole paper, it's immensely useful).

    0 讨论(0)
  • 2020-12-07 11:08

    There are a much faster functions that do very similar things:

    x / (1 + abs(x)) – fast replacement for TAHN

    And similarly:

    x / (2 + 2 * abs(x)) + 0.5 - fast replacement for SIGMOID

    Compare plots with actual sigmoid

    0 讨论(0)
  • 2020-12-07 11:09

    Note: This is a follow-up to this post.

    Edit: Update to calculate the same thing as this and this, taking some inspiration from this.

    Now look what you made me do! You made me install Mono!

    $ gmcs -optimize test.cs && mono test.exe
    Max deviation is 0.001663983
    10^7 iterations using Sigmoid1() took 1646.613 ms
    10^7 iterations using Sigmoid2() took 237.352 ms
    

    C is hardly worth the effort anymore, the world is moving forward :)

    So, just over a factor 10 6 faster. Someone with a windows box gets to investigate the memory usage and performance using MS-stuff :)

    Using LUTs for activation functions is not so uncommon, especielly when implemented in hardware. There are many well proven variants of the concept out there if you are willing to include those types of tables. However, as have already been pointed out, aliasing might turn out to be a problem, but there are ways around that too. Some further reading:

    • NEURObjects by Giorgio Valentini (there's also a paper on this)
    • Neural networks with digital LUT activation functions
    • Boosting neural network feature extraction by reduced accuracy activation functions
    • A New Learning Algorithm for Neural Networks with Integer Weights and Quantized Non-linear Activation Functions
    • The effects of quantization on high order function neural networks

    Some gotchas with this:

    • The error goes up when you reach outside the table (but converges to 0 at the extremes); for x approx +-7.0. This is due to the chosen scaling factor. Larger values of SCALE give higher errors in the middle range, but smaller at the edges.
    • This is generally a very stupid test, and I don't know C#, It's just a plain conversion of my C-code :)
    • Rinat Abdullin is very much correct that aliasing and precision loss might cause problems, but since I have not seen the variables for that I can only advice you to try this. In fact, I agree with everything he says except for the issue of lookup tables.

    Pardon the copy-paste coding...

    using System;
    using System.Diagnostics;
    
    class LUTTest {
        private const float SCALE = 320.0f;
        private const int RESOLUTION = 2047;
        private const float MIN = -RESOLUTION / SCALE;
        private const float MAX = RESOLUTION / SCALE;
    
        private static readonly float[] lut = InitLUT();
    
        private static float[] InitLUT() {
          var lut = new float[RESOLUTION + 1];
    
          for (int i = 0; i < RESOLUTION + 1; i++) {
            lut[i] = (float)(1.0 / (1.0 + Math.Exp(-i / SCALE)));
          }
          return lut;
        }
    
        public static float Sigmoid1(double value) {
            return (float) (1.0 / (1.0 + Math.Exp(-value)));
        }
    
        public static float Sigmoid2(float value) {
          if (value <= MIN) return 0.0f;
          if (value >= MAX) return 1.0f;
          if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
          return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
        }
    
        public static float error(float v0, float v1) {
          return Math.Abs(v1 - v0);
        }
    
        public static float TestError() {
            float emax = 0.0f;
            for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
              float v0 = Sigmoid1(x);
              float v1 = Sigmoid2(x);
              float e = error(v0, v1);
              if (e > emax) emax = e;
            }
            return emax;
        }
    
        public static double TestPerformancePlain() {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            for (int i = 0; i < 10; i++) {
                for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                    Sigmoid1(x);
                }
            }
            sw.Stop();
            return sw.Elapsed.TotalMilliseconds;
        }    
    
        public static double TestPerformanceLUT() {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            for (int i = 0; i < 10; i++) {
                for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                    Sigmoid2(x);
                }
            }
            sw.Stop();
            return sw.Elapsed.TotalMilliseconds;
        }    
    
        static void Main() {
            Console.WriteLine("Max deviation is {0}", TestError());
            Console.WriteLine("10^7 iterations using Sigmoid1() took {0} ms", TestPerformancePlain());
            Console.WriteLine("10^7 iterations using Sigmoid2() took {0} ms", TestPerformanceLUT());
        }
    }
    
    0 讨论(0)
  • 2020-12-07 11:10

    If you're able to interop with C++, you could consider storing all the values in an array and loop over them using SSE like this:

    void sigmoid_sse(float *a_Values, float *a_Output, size_t a_Size){
        __m128* l_Output = (__m128*)a_Output;
        __m128* l_Start  = (__m128*)a_Values;
        __m128* l_End    = (__m128*)(a_Values + a_Size);
    
        const __m128 l_One        = _mm_set_ps1(1.f);
        const __m128 l_Half       = _mm_set_ps1(1.f / 2.f);
        const __m128 l_OneOver6   = _mm_set_ps1(1.f / 6.f);
        const __m128 l_OneOver24  = _mm_set_ps1(1.f / 24.f);
        const __m128 l_OneOver120 = _mm_set_ps1(1.f / 120.f);
        const __m128 l_OneOver720 = _mm_set_ps1(1.f / 720.f);
        const __m128 l_MinOne     = _mm_set_ps1(-1.f);
    
        for(__m128 *i = l_Start; i < l_End; i++){
            // 1.0 / (1.0 + Math.Pow(Math.E, -value))
            // 1.0 / (1.0 + Math.Exp(-value))
    
            // value = *i so we need -value
            __m128 value = _mm_mul_ps(l_MinOne, *i);
    
            // exp expressed as inifite series 1 + x + (x ^ 2 / 2!) + (x ^ 3 / 3!) ...
            __m128 x = value;
    
            // result in l_Exp
            __m128 l_Exp = l_One; // = 1
    
            l_Exp = _mm_add_ps(l_Exp, x); // += x
    
            x = _mm_mul_ps(x, x); // = x ^ 2
            l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_Half, x)); // += (x ^ 2 * (1 / 2))
    
            x = _mm_mul_ps(value, x); // = x ^ 3
            l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver6, x)); // += (x ^ 3 * (1 / 6))
    
            x = _mm_mul_ps(value, x); // = x ^ 4
            l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver24, x)); // += (x ^ 4 * (1 / 24))
    
    #ifdef MORE_ACCURATE
    
            x = _mm_mul_ps(value, x); // = x ^ 5
            l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver120, x)); // += (x ^ 5 * (1 / 120))
    
            x = _mm_mul_ps(value, x); // = x ^ 6
            l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver720, x)); // += (x ^ 6 * (1 / 720))
    
    #endif
    
            // we've calculated exp of -i
            // now we only need to do the '1.0 / (1.0 + ...' part
            *l_Output++ = _mm_rcp_ps(_mm_add_ps(l_One,  l_Exp));
        }
    }
    

    However, remember that the arrays you'll be using should be allocated using _aligned_malloc(some_size * sizeof(float), 16) because SSE requires memory aligned to a boundary.

    Using SSE, I can calculate the result for all 100 million elements in around half a second. However, allocating that much memory at a time will cost you nearly two-third of a gigabyte so I'd suggest processing more but smaller arrays at a time. You might even want to consider using a double buffering approach with 100K elements or more.

    Also, if the number of elements starts to grow considerably you might want to choose to process these things on the GPU (just create a 1D float4 texture and run a very trivial fragment shader).

    0 讨论(0)
提交回复
热议问题