lddqu used a different strategy than movdqu on P4, but runs identically on all other CPUs that support it. There's no particular downside (since SSE3 instructions don't take any extra bytes of machine code, and are fairly widely supported even by AMD at this point), but no upside at all unless you care about P4.
Dark Shikari (one of the x264 video encoder lead developers, responsible for a lot of SSE speedups) went into detail about it in a blog post in 2008. This is an archive.org link since the original is offline, but there's a lot of good stuff in his blog.
The most interesting point he makes is that Core2 still has slow unaligned loads, where manually doing two aligned loads and a palignr
can be faster, but is only available with an immediate shift count. Since Core2 runs lddqu
the same as movdqu
, it doesn't help.
Apparently Core1 does implement lddqu
specially, so it's not just P4 after all.
This Intel blog post about the history of lddqu/movdqu (which I found in 2 seconds with google for lddqu vs movdqu
, /scold @Zboson) explains:
(on P4 only):
The instruction works by
loading a 32-byte block aligned on a 16-byte boundary, extracting the 16 bytes corresponding to the unaligned
access.
Because the instruction loads more bytes than requested, some usage restrictions apply. Lddqu should
be avoided on Uncached (UC) and Write-Combining (USWC) memory regions. Also, by its implementation,
lddqu should be avoided in situations where store-load forwarding is expected.
So I guess this explains why they didn't just use that strategy to implement movdqu
all the time.
I guess the decoders don't have the memory-type information available, and that's when the decision has to be made on which uops to decode the instruction to. So trying to be "smart" about using the better strategy opportunistically on WB memory probably wasn't possible, even if it was desirable. (Which it isn't because of store-forwarding).
The summary from that blog post:
starting from Intel Core 2 brand (Core microarchitecture , from mid 2006, Merom CPU and higher) up to the future: lddqu does the same thing as movdqu
In the other words:
* if CPU supports Supplemental Streaming SIMD Extensions 3 (SSSE3) -> lddqu does the same thing as movdqu,
* If CPU doesn’t support SSSE3 but supports SSE3 -> go for lddqu
(and note that story about memory types )