micro-optimization

MMX Register Speed vs Stack for Unsigned Integer Storage

时光怂恿深爱的人放手 提交于 2021-02-05 08:56:49
问题 I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers. But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially

When to use a certain calling convention

二次信任 提交于 2021-02-05 06:44:05
问题 Are there any guidelines in x86-64 for when a function should abide by the System V guidelines and when it doesn't matter? This is in response to an answer here which mentions using other calling conventions for simplifying an internal/local function. # gcc 32-bit regparm calling convention is_even: # input in RAX, bool return value in AL not %eax # 2 bytes and $1, %al # 2 bytes ret # custom calling convention: is_even: # input in RDI # returns in ZF. ZF=1 means even test $1, %dil # 4 bytes.

Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?

≯℡__Kan透↙ 提交于 2021-02-05 06:39:28
问题 I was reading a text book and it has an exercise that write x86-64 assembly code based on C code //Assume that the values of sp and dp are stored in registers %rdi and %rsi int *sp; char *dp; *dp = (char) *sp; and the answer is: //first approach movl (%rdi), %eax //Read 4 bytes movb %al, (%rsi) //Store low-order byte I can understand it but just wondering can't we do sth simple in the first place as: //second approach movb (%rdi), %al //Read one bytes only rather than read all four bytes movb

Are these the smallest possible x86 macros for these stack operations?

廉价感情. 提交于 2021-01-29 17:03:41
问题 I'm making a stack based language as a fun personal project. So, I have some signed/unsigned 32-bit values on the stack and my goal is to write some assembly macros that operate on this stack. Ideally these will be small since they'll be used a lot. Since I'm new to x86 assembly I was wondering if you guys had any tips or improvements you could think of. I'd greatly appreciate your time, thanks! Note: An optimizer is used after the macros are expanded to avoid cases like pop eax; push eax so

Trying to understand clang/gcc __builtin_memset on constant size / aligned pointers

喜你入骨 提交于 2021-01-28 22:06:33
问题 Basically I am trying to understand why both gcc/clang use xmm register for their __builtin_memset even when the memory destination and size are both divisible by sizeof ymm (or zmm for that matter) and the CPU supports AVX2 / AVX512 . and why GCC implements __builtin_memset on medium sized values without any SIMD (again assuming CPU supports SIMD). For example: __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64)); Will compile to: vpcmpeqd %xmm0, %xmm0, %xmm0 vmovdqa %xmm0, (%rdi)

Shorter x86 call instruction

半腔热情 提交于 2021-01-28 09:26:52
问题 For context I am x86 golfing. 00000005 <start>: 5: e8 25 00 00 00 call 2f <cube> a: 50 push %eax Multiple calls later... 0000002f <cube>: 2f: 89 c8 mov %ecx,%eax 31: f7 e9 imul %ecx 33: f7 e9 imul %ecx 35: c3 ret call took 5 bytes even though the offset fit into a single byte! Is there any way to write call cube and assemble with GNU assembler and get a smaller offset? I understand 16 bit offsets could be used, but ideally I'd have a 2 byte instruction like call reg . 回答1: There is no call

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

喜你入骨 提交于 2021-01-27 06:28:43
问题 Which is generally faster to test the byte in AL for zero / non-zero? TEST EAX, EAX TEST AL, AL Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about. So AL=EAX and there are no partial-register penalties for reading EAX. Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

£可爱£侵袭症+ 提交于 2021-01-27 06:28:10
问题 Which is generally faster to test the byte in AL for zero / non-zero? TEST EAX, EAX TEST AL, AL Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about. So AL=EAX and there are no partial-register penalties for reading EAX. Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access

In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

拈花ヽ惹草 提交于 2021-01-27 03:59:22
问题 Is there a good way of optimising this code (x86-64) ? mov dword ptr[rsp], 0; mov dword ptr[rsp+4], 0 where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants. Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix. I’m thinking of something like (don’t know if the following

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

核能气质少年 提交于 2021-01-20 04:49:33
问题 Consider a simple instruction like mov RCX, RDI # 48 89 f9 The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes): .byte 0x67 mov RCX, RDI # 67 48 89 f9 67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either.