I\'ve been profiling an application all day long and, having optimized a couple bits of code, I\'m left with this on my todo list. It\'s the activation function for a neural
If it's for an activation function, does it matter terribly much if the calculation of e^x is completely accurate?
For example, if you use the approximation (1+x/256)^256, on my Pentium testing in Java (I'm assuming C# essentially compiles to the same processor instructions) this is about 7-8 times faster than e^x (Math.exp()), and is accurate to 2 decimal places up to about x of +/-1.5, and within the correct order of magnitude across the range you stated. (Obviously, to raise to the 256, you actually square the number 8 times -- don't use Math.Pow for this!) In Java:
double eapprox = (1d + x / 256d);
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
Keep doubling or halving 256 (and adding/removing a multiplication) depending on how accurate you want the approximation to be. Even with n=4, it still gives about 1.5 decimal places of accuracy for values of x beween -0.5 and 0.5 (and appears a good 15 times faster than Math.exp()).
P.S. I forgot to mention -- you should obviously not really divide by 256: multiply by a constant 1/256. Java's JIT compiler makes this optimisation automatically (at least, Hotspot does), and I was assuming that C# must do too.
This is slightly off topic, but just out of curiosity, I did the same implementation as the one in C, C# and F# in Java. I'll just leave this here in case someone else is curious.
Result:
$ javac LUTTest.java && java LUTTest
Max deviation is 0.001664
10^7 iterations using sigmoid1() took 1398 ms
10^7 iterations using sigmoid2() took 177 ms
I suppose the improvement over C# in my case is due to Java being better optimized than Mono for OS X. On a similar MS .NET-implementation (vs. Java 6 if someone wants to post comparative numbers) I suppose the results would be different.
Code:
public class LUTTest {
private static final float SCALE = 320.0f;
private static final int RESOLUTION = 2047;
private static final float MIN = -RESOLUTION / SCALE;
private static final float MAX = RESOLUTION / SCALE;
private static final float[] lut = initLUT();
private static float[] initLUT() {
float[] lut = new float[RESOLUTION + 1];
for (int i = 0; i < RESOLUTION + 1; i++) {
lut[i] = (float)(1.0 / (1.0 + Math.exp(-i / SCALE)));
}
return lut;
}
public static float sigmoid1(double value) {
return (float) (1.0 / (1.0 + Math.exp(-value)));
}
public static float sigmoid2(float value) {
if (value <= MIN) return 0.0f;
if (value >= MAX) return 1.0f;
if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
}
public static float error(float v0, float v1) {
return Math.abs(v1 - v0);
}
public static float testError() {
float emax = 0.0f;
for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
float v0 = sigmoid1(x);
float v1 = sigmoid2(x);
float e = error(v0, v1);
if (e > emax) emax = e;
}
return emax;
}
public static long sigmoid1Perf() {
float y = 0.0f;
long t0 = System.currentTimeMillis();
for (int i = 0; i < 10; i++) {
for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
y = sigmoid1(x);
}
}
long t1 = System.currentTimeMillis();
System.out.printf("",y);
return t1 - t0;
}
public static long sigmoid2Perf() {
float y = 0.0f;
long t0 = System.currentTimeMillis();
for (int i = 0; i < 10; i++) {
for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
y = sigmoid2(x);
}
}
long t1 = System.currentTimeMillis();
System.out.printf("",y);
return t1 - t0;
}
public static void main(String[] args) {
System.out.printf("Max deviation is %f\n", testError());
System.out.printf("10^7 iterations using sigmoid1() took %d ms\n", sigmoid1Perf());
System.out.printf("10^7 iterations using sigmoid2() took %d ms\n", sigmoid2Perf());
}
}
You might also consider experimenting with alternative activation functions which are cheaper to evaluate. For example:
f(x) = (3x - x**3)/2
(which could be factored as
f(x) = x*(3 - x*x)/2
for one less multiplication). This function has odd symmetry, and its derivative is trivial. Using it for a neural network requires normalizing the sum-of-inputs by dividing by the total number of inputs (limiting the domain to [-1..1], which is also range).
1) Do you call this from only one place? If so, you may gain a small amount of performance by moving the code out of that function and just putting it right where you would normally have called the Sigmoid function. I don't like this idea in terms of code readability and organization but when you need to get every last performance gain, this might help because I think function calls require a push/pop of registers on the stack, which could be avoided if your code was all inline.
2) I have no idea if this might help but try making your function parameter a ref parameter. See if it's faster. I would have suggested making it const (which would have been an optimization if this were in c++) but c# doesn't support const parameters.
At 100 million calls, i'd start to wonder if profiler overhead isn't skewing your results. Replace the calculation with a no-op and see if it is still reported to consume 60% of the execution time...
Or better yet, create some test data and use a stopwatch timer to profile a million or so calls.
A mild variation on Soprano's theme:
public static float Sigmoid(double value) {
float v = value;
float k = Math.Exp(v);
return k / (1.0f + k);
}
Since you're only after a single precision result, why make the Math.Exp function calculate a double? Any exponent calculator that uses an iterative summation (see the expansion of the ex) will take longer for more precision, each and every time. And double is twice the work of single! This way, you convert to single first, then do your exponential.
But the expf function should be faster still. I don't see the need for soprano's (float) cast in passing to expf though, unless C# doesn't do implicit float-double conversion.
Otherwise, just use a real language, like FORTRAN...