I\'m currently coding an application in C# which could benefit a great deal from using SSE, as a relative small piece of code causes 90-95% of the execution time. The code i
Open-source Yeppp! library (of which I am the author) provides SIMD-optimized data processing functions, and is usable from .Net languages via official bindings. It supports not only SSE, but also later SIMD extensions up to AVX2 from the upcoming Intel Haswell processors. The library automatically chooses the optimal version for the processor it runs on.
C# supports quite a few SIMD/SSE instructions natively in System.Numerics which is cross-platform. Dot product is a supported instruction.
HPCsharp nuget package on nuget.org, which I've been actively developing for the last two years, uses this capability to accelerate many algorithms. Let me know if certain useful algorithms could use acceleration thru SIMD/SSE and multi-core.
As of April 2013, Steam Survey reports that only 64% of PCs have support for SSE4.1. In other words, if you assume SSE4.1 support, you'll crash on about a third of all consumer PCs.
I am not familiar with Mono.Simd, but a good alternative on Windows is DirectXMath, if you can be bothered to write a suitable C++/CLI wrapper. Neither will take advantage of all the latest instructions, but you can supplement these on a need-to basis relatively easily with intrinsics. I'm not sure you'll be able to do significantly better than Mono.Simd with it though.
There is no such thing as "inline assembly" in C#; if you want to use C++ or assembly code from C#, you'll have to call it via P/Invoke or a C++/CLI wrapper. Out of the two, C++/CLI has less overhead.
That said, if you need to optimize the hell out of a small piece of code, the best option might be to rewrite that piece of code entirely in native C++.