I\'ve been profiling an application all day long and, having optimized a couple bits of code, I\'m left with this on my todo list. It\'s the activation function for a neural
FWIW, here's my C# benchmarks for the answers already posted. (Empty is a function that just returns 0, to measure the function call overhead)
Empty Function: 79ms 0 Original: 1576ms 0.7202294 Simplified: (soprano) 681ms 0.7202294 Approximate: (Neil) 441ms 0.7198783 Bit Manip: (martinus) 836ms 0.72318 Taylor: (Rex Logan) 261ms 0.7202305 Lookup: (Henrik) 182ms 0.7204863
public static object[] Time(Func<double, float> f) {
var testvalue = 0.9456;
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 1e7; i++)
f(testvalue);
return new object[] { sw.ElapsedMilliseconds, f(testvalue) };
}
public static void Main(string[] args) {
Console.WriteLine("Empty: {0,10}ms {1}", Time(Empty));
Console.WriteLine("Original: {0,10}ms {1}", Time(Original));
Console.WriteLine("Simplified: {0,10}ms {1}", Time(Simplified));
Console.WriteLine("Approximate: {0,10}ms {1}", Time(ExpApproximation));
Console.WriteLine("Bit Manip: {0,10}ms {1}", Time(BitBashing));
Console.WriteLine("Taylor: {0,10}ms {1}", Time(TaylorExpansion));
Console.WriteLine("Lookup: {0,10}ms {1}", Time(LUT));
}
Soprano had some nice optimizations your call:
public static float Sigmoid(double value)
{
float k = Math.Exp(value);
return k / (1.0f + k);
}
If you try a lookup table and find it uses too much memory you could always looking at the value of your parameter for each successive calls and employing some caching technique.
For example try caching the last value and result. If the next call has the same value as the previous one, you don't need to calculate it as you'd have cached the last result. If the current call was the same as the previous call even 1 out of a 100 times, you could potentially save yourself 1 million calculations.
Or, you may find that within 10 successive calls, the value parameter is on average the same 2 times, so you could then try caching the last 10 values/answers.
Off the top of my head, this paper explains a way of approximating the exponential by abusing floating point, (click the link in the top right for PDF) but I don't know if it'll be of much use to you in .NET.
Also, another point: for the purpose of training large networks quickly, the logistic sigmoid you're using is pretty terrible. See section 4.4 of Efficient Backprop by LeCun et al and use something zero-centered (actually, read that whole paper, it's immensely useful).
There are a much faster functions that do very similar things:
x / (1 + abs(x))
– fast replacement for TAHN
And similarly:
x / (2 + 2 * abs(x)) + 0.5
- fast replacement for SIGMOID
Compare plots with actual sigmoid
Note: This is a follow-up to this post.
Edit: Update to calculate the same thing as this and this, taking some inspiration from this.
Now look what you made me do! You made me install Mono!
$ gmcs -optimize test.cs && mono test.exe
Max deviation is 0.001663983
10^7 iterations using Sigmoid1() took 1646.613 ms
10^7 iterations using Sigmoid2() took 237.352 ms
C is hardly worth the effort anymore, the world is moving forward :)
So, just over a factor 10 6 faster. Someone with a windows box gets to investigate the memory usage and performance using MS-stuff :)
Using LUTs for activation functions is not so uncommon, especielly when implemented in hardware. There are many well proven variants of the concept out there if you are willing to include those types of tables. However, as have already been pointed out, aliasing might turn out to be a problem, but there are ways around that too. Some further reading:
Some gotchas with this:
Pardon the copy-paste coding...
using System;
using System.Diagnostics;
class LUTTest {
private const float SCALE = 320.0f;
private const int RESOLUTION = 2047;
private const float MIN = -RESOLUTION / SCALE;
private const float MAX = RESOLUTION / SCALE;
private static readonly float[] lut = InitLUT();
private static float[] InitLUT() {
var lut = new float[RESOLUTION + 1];
for (int i = 0; i < RESOLUTION + 1; i++) {
lut[i] = (float)(1.0 / (1.0 + Math.Exp(-i / SCALE)));
}
return lut;
}
public static float Sigmoid1(double value) {
return (float) (1.0 / (1.0 + Math.Exp(-value)));
}
public static float Sigmoid2(float value) {
if (value <= MIN) return 0.0f;
if (value >= MAX) return 1.0f;
if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
}
public static float error(float v0, float v1) {
return Math.Abs(v1 - v0);
}
public static float TestError() {
float emax = 0.0f;
for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
float v0 = Sigmoid1(x);
float v1 = Sigmoid2(x);
float e = error(v0, v1);
if (e > emax) emax = e;
}
return emax;
}
public static double TestPerformancePlain() {
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 10; i++) {
for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
Sigmoid1(x);
}
}
sw.Stop();
return sw.Elapsed.TotalMilliseconds;
}
public static double TestPerformanceLUT() {
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 10; i++) {
for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
Sigmoid2(x);
}
}
sw.Stop();
return sw.Elapsed.TotalMilliseconds;
}
static void Main() {
Console.WriteLine("Max deviation is {0}", TestError());
Console.WriteLine("10^7 iterations using Sigmoid1() took {0} ms", TestPerformancePlain());
Console.WriteLine("10^7 iterations using Sigmoid2() took {0} ms", TestPerformanceLUT());
}
}
If you're able to interop with C++, you could consider storing all the values in an array and loop over them using SSE like this:
void sigmoid_sse(float *a_Values, float *a_Output, size_t a_Size){
__m128* l_Output = (__m128*)a_Output;
__m128* l_Start = (__m128*)a_Values;
__m128* l_End = (__m128*)(a_Values + a_Size);
const __m128 l_One = _mm_set_ps1(1.f);
const __m128 l_Half = _mm_set_ps1(1.f / 2.f);
const __m128 l_OneOver6 = _mm_set_ps1(1.f / 6.f);
const __m128 l_OneOver24 = _mm_set_ps1(1.f / 24.f);
const __m128 l_OneOver120 = _mm_set_ps1(1.f / 120.f);
const __m128 l_OneOver720 = _mm_set_ps1(1.f / 720.f);
const __m128 l_MinOne = _mm_set_ps1(-1.f);
for(__m128 *i = l_Start; i < l_End; i++){
// 1.0 / (1.0 + Math.Pow(Math.E, -value))
// 1.0 / (1.0 + Math.Exp(-value))
// value = *i so we need -value
__m128 value = _mm_mul_ps(l_MinOne, *i);
// exp expressed as inifite series 1 + x + (x ^ 2 / 2!) + (x ^ 3 / 3!) ...
__m128 x = value;
// result in l_Exp
__m128 l_Exp = l_One; // = 1
l_Exp = _mm_add_ps(l_Exp, x); // += x
x = _mm_mul_ps(x, x); // = x ^ 2
l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_Half, x)); // += (x ^ 2 * (1 / 2))
x = _mm_mul_ps(value, x); // = x ^ 3
l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver6, x)); // += (x ^ 3 * (1 / 6))
x = _mm_mul_ps(value, x); // = x ^ 4
l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver24, x)); // += (x ^ 4 * (1 / 24))
#ifdef MORE_ACCURATE
x = _mm_mul_ps(value, x); // = x ^ 5
l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver120, x)); // += (x ^ 5 * (1 / 120))
x = _mm_mul_ps(value, x); // = x ^ 6
l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver720, x)); // += (x ^ 6 * (1 / 720))
#endif
// we've calculated exp of -i
// now we only need to do the '1.0 / (1.0 + ...' part
*l_Output++ = _mm_rcp_ps(_mm_add_ps(l_One, l_Exp));
}
}
However, remember that the arrays you'll be using should be allocated using _aligned_malloc(some_size * sizeof(float), 16) because SSE requires memory aligned to a boundary.
Using SSE, I can calculate the result for all 100 million elements in around half a second. However, allocating that much memory at a time will cost you nearly two-third of a gigabyte so I'd suggest processing more but smaller arrays at a time. You might even want to consider using a double buffering approach with 100K elements or more.
Also, if the number of elements starts to grow considerably you might want to choose to process these things on the GPU (just create a 1D float4 texture and run a very trivial fragment shader).