I\'m writing a performance-critical, number-crunching C++ project where 70% of the time is used by the 200 line core module.
I\'d like to optimize the core using inl
I really like assembly, so I'm not going to be a nay-sayer here. It appears that you've profiled your code and found the 'hotspot', which is the correct way to start. I also assume that the 200 lines in question don't use a lot of high-level constructs like vector
.
I do have to give one bit of warning: if the number-crunching involves floating-point math, you are in for a world of pain, specifically a whole set of specialized instructions, and a college term's worth of algorithmic study.
All that said: if I were you, I'd step through the code in question in the VS debugger, using the Disassembly view. If you feel comfortable reading the code as you go along, that's a good sign. After that, do a Release compile (Debug turns off optimization) and generate an ASM listing for that module. Then if you think you see room for improvement...you have a place to start. Other people's answers have linked to the MSDN documentation, which is really pretty skimpy but still a reasonable start.
The microsoft compiler is very poor at optimisations when inline assembly gets involved. It has to back up registers because if you use eax then it won't move eax to another free register it will continue using eax. The GCC assembler is far more advanced on this front.
To get round this microsoft started offering intrinsics. These are a far better way to do your optimisation as it allows the compiler to work with you. As Chris mentioned inline assembly doesn't work under x64 with the MS compiler as well so on that platform you REALLY are better off just using the intrinsics.
They are easy to use and give good performance. I will admit I am often able to squeeze a few more cycles out of it by using an external assembler but they're bloody good for the productivity improvement they provide
I prefer writing entire functions in assembly rather than using inline
assembly. This allows you to swap out the high level language function with the assembly one during the build process. Also, you don't have to worry about compiler optimizations getting in the way.
Before you write a single line of assembly, print out the assembly language listing for your function. This gives you a foundation to build upon or modify. Another helpful tool is the interweaving of assembly with source code. This will tell you how the compiler is coding specific statements.
If you need to insert inline assembly for a large function, make a new function for the code that you need to inline. Again replace with C++ or assembly during build time.
These are my suggestions, Your Mileage May Vary (YMMV).
Nothing is in the registers. as the _asm block is executed. You need to move stuff into the registers. If there is a variable: 'a', then you would need to
__asm {
mov eax, [a]
}
It is worth pointing out that VS2010 comes with Microsofts assembler. Right click on a project, go to build rules and turn on the assembler build rules and the IDE will then process .asm files.
this is a somewhat better solution as VS2010 supports 32bit AND 64bit projects and the __asm keyword does NOT work in 64bit builds. You MUST use external assembler for 64bit code :/
You can access variables by their name and copy them to registers. Here's an example from MSDN:
int power2( int num, int power )
{
__asm
{
mov eax, num ; Get first argument
mov ecx, power ; Get second argument
shl eax, cl ; EAX = EAX * ( 2 to the power of CL )
}
// Return with result in EAX
}
Using C or C++ in ASM blocks might be also interesting for you.
Go for the low hanging fruit first...
As other have said, the Microsoft compiler is pretty poor at optimisation. You may be able to save yourself a lot of effort just by investing in a decent compiler, such as Intel's ICC, and re-compiling the code "as is". You can get a 30 day free evaluation license from Intel and try it out.
Also, if you have the option to build a 64-bit executable, then running in 64-bit mode can yield a 30% performance improvement, due to the x2 increase in number of available registers.