I\'m developing optimizations for my 3D calculations and I now have:
plain
\" version using the standard C language libraries,One way is to implement three libraries conforming to the same interface. With dynamic libraries, you can just swap the library file and the executable will use whatever it finds. For example on Windows, you could compile three DLLs:
And then make the executable link against Impl.dll
. Now just put one of the three specific DLLs into the same directory as the .exe
, rename it to Impl.dll
, and it will use that version. The same principle should basically be applicable on a UNIX-like OS.
The next step would be to load the libraries programmatically, which is probably the most flexible, but it is OS specific and requires some more work (like opening the library, obtaining function pointers etc.)
Edit: But of course, you could just implement the function three times and select one at runtime, depending on some parameter/config file setting etc., as lined out in the other answers.
Of course it's possible.
The best way to do it is to have functions that do the complete job, and select among them at runtime. This would work but is not optimal:
typedef enum
{
calc_type_invalid = 0,
calc_type_plain,
calc_type_sse,
calc_type_avx,
calc_type_max // not a valid value
} calc_type;
void do_my_calculation(float const *input, float *output, size_t len, calc_type ct)
{
float f;
size_t i;
for (i = 0; i < len; ++i)
{
switch (ct)
{
case calc_type_plain:
// plain calculation here
break;
case calc_type_sse:
// SSE calculation here
break;
case calc_type_avx:
// AVX calculation here
break;
default:
fprintf(stderr, "internal error, unexpected calc_type %d", ct);
exit(1);
break
}
}
}
On each pass through the loop, the code is executing a switch
statement, which is just overhead. A really clever compiler could theoretically fix it for you, but better to fix it yourself.
Instead, write three separate functions, one for plain, one for SSE, and one for AVX. Then decide at runtime which one to run.
For bonus points, in a "debug" build, do the calculation with both the SSE and the plain, and assert that the results are close enough to give confidence. Write the plain version, not for speed, but for correctness; then use its results to verify that your clever optimized versions get the correct answer.
The legendary John Carmack recommends the latter approach; he calls it "parallel implementations". Read his essay about it.
So I recommend you write the plain version first. Then, go back and start re-writing parts of your application using SSE or AVX acceleration, and make sure that the accelerated versions give the correct answers. (And sometimes, the plain version might have a bug that the accelerated version doesn't. Having two versions and comparing them helps make bugs come to light in either version.)
There are several solutions for this.
One is based on C++, where you'd create multiple classes - typically, you implement a interface class, and use a factory function to give you an object of the correct class.
e.g.
class Matrix
{
virtual void Multiply(Matrix &result, Matrix& a, Matrix &b) = 0;
...
};
class MatrixPlain : public Matrix
{
void Multiply(Matrix &result, Matrix& a, Matrix &b);
};
void MatrixPlain::Multiply(...)
{
... implementation goes here...
}
class MatrixSSE: public Matrix
{
void Multiply(Matrix &result, Matrix& a, Matrix &b);
}
void MatrixSSE::Multiply(...)
{
... implementation goes here...
}
... same thing for AVX...
Matrix* factory()
{
switch(type_of_math)
{
case PlainMath:
return new MatrixPlain;
case SSEMath:
return new MatrixSSE;
case AVXMath:
return new MatrixAVX;
default:
cerr << "Error, unknown type of math..." << endl;
return NULL;
}
}
Or, as suggested above, you can use shared libraries that have a common interface, and dynamically load the library that is right.
Of course, if you implement the Matrix base class as your "plain" class, you could do stepwise refinement and implementing only the parts you actually find is beneficial, and rely on the baseclass to implement the functions where performance isn't highly crticial.
Edit: You talk about inline, and I think you are looking at the wrong level of function if that is the case. You want fairly large functions that do something on quite a bit of data. Otherwise, all your effort will be spent on preparing the data into the right format, and then doing a few calculation instructions, and then putting the data back into memory.
I would also consider how you store your data. Are you storing sets of an array with X, Y, Z, W, or are you storing lots of X, lots of Y, lots of Z and lots of W in separate arrays [assuming we're doing 3D calculations]? Depending on how your calculation works, you may find that doing one or the other way will give you the best benefit.
I've done a fair bit of SSE and 3DNow! optimisations some years back, and the "trick" is often more about how you store the data so you can easily grab a "bundle" of the right kind of data in one go. If you have the data stored the wrong way, you will be wasting a lot of the time "swizzling data" (moving data from one way of storing to another).