I know that MSVC compiler in x64 mode does not support inline assembly snippets of code, and in order to use assembly code you have to define your function in some external
No, there is no way to do what you want.
Microsoft's compiler doesn't support inline assembly for x86-64 targets, as you said. This forces you to define your assembly functions in an external code module (*.asm), assemble them with MASM, and link the result together with your separately-compiled C/C++ code.
The required separation of steps means that the C/C++ compiler cannot inline your assembly functions because they are not visible to it at the time of compilation.
Even with link-time code generation (LTCG) enabled, your assembly module(s) will not get inlined because the linker simply doesn't support this.
There is absolutely no way to get assembly functions written in a separate module inlined directly into C or C++ code.
There is no way that the inline
or __forceinline
keywords could do anything. In fact, there's no way that you could use them without a compiler error (or at least a warning). These annotations have to go on the function's definition (which, for an inline function, is the same as its declaration), but you can't put it on the function's definition, since that's defined in a separate *.asm file. These aren't MASM keywords, so trying to add them to the definition would necessarily result in an error. And putting them on the forward declaration of the assembly function in the C header is going to be similarly unsuccessful, since there's no code there to inline—just a prototype.
This is why Microsoft recommends using intrinsics. You can use these directly in your C or C++ code, and the compiler will emit the corresponding assembly code automatically. Not only does this accomplish the desired inlining, but intrinsics even allow the optimizer to function, further improving the results. No, intrinsics do not lead to perfect code, and there aren't intrinsics for everything, but it's the best you can do with Microsoft's compiler.
Your only other alternative is to sit down and play with various permutations of C/C++ code until you get the compiler to generate the desired object code. This can be very powerful in cases where intrinsics are not available for the instructions that you wish to be generated, but it does take a lot of time spent fidgeting, and you'll have to revisit it to make sure it continues to do what you want when you upgrade compiler versions.