SIMD versions of SHLD/SHRD instructions

你离开我真会死。 提交于 2019-11-30 20:44:18

In this answer I'm only going to talk about x64.
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000.
All times are according to Agner Fog's instruction tables.

Intel Skylake example timings*
The shld/shrd instructions are rather slow on x64.
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower.
I'm going to assume you want to shift by a variable amount, which means a

SHLD RAX,RDX,cl        4 uops, 4 cycle latency.  -> 1/16 per bit

Using 2 shifts + add you can do this faster slower.

@Init:
MOV R15,-1
SHR R15,cl    //mask for later use.    
@Work:
SHL RAX,cl        3 uops, 2 cycle latency
ROL RDX,cl        3 uops, 2 cycle latency
AND RDX,R15       1 uops, 0.25 latency
OR RAX,RDX        1 uops, 0.25 latency    
//Still needs unrolling to achieve least amount of slowness.

Note that this only shifts 64 bits because RDX is not affected.
So you're trying to beat 4 cycles per 64 bits.

//4*64 bits parallel shift.  
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3    1uop, 0.5 cycle latency.  

However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results.

VPSLLVQ YMM1, YMM2, YMM3    1uop, 0.5 cycle latency.  
VPSRLVQ YMM5, YMM2, YMM4    1uop, 0.5 cycle latency.   
VPOR    YMM1, YMM1, YMM5    1uop, 0.33 cycle latency.   

You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers.
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead.
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad.
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster.

'It’s the Memory, Stupid!'
Obviously I've not added in loop overhead and load/stores to/from memory.
The loop overhead is tiny given proper alignment of jump targets, but the memory
access will easily be the biggest slowdown.
A single cache miss to main memory on Skylake can cost you more than 250 cycles1.
It is in clever management of memory that the major gains will be made.
The 12 times possible speed-up using AVX256 is small potatoes in comparison.

I'm not counting the set up of the shift counter in CL/(YMM3/YMM4) because I'm assuming you'll reuse that value over many iterations.

You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available.
The only current processor that supports currently is Knights Landing.

*) All these timings are best case values, and should be taken as indications, not as hard values.
1) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!