micro-optimization | 易学教程

MMX Register Speed vs Stack for Unsigned Integer Storage

阅读更多关于 MMX Register Speed vs Stack for Unsigned Integer Storage

问题 I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers. But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially

When to use a certain calling convention

阅读更多关于 When to use a certain calling convention

问题 Are there any guidelines in x86-64 for when a function should abide by the System V guidelines and when it doesn't matter? This is in response to an answer here which mentions using other calling conventions for simplifying an internal/local function. # gcc 32-bit regparm calling convention is_even: # input in RAX, bool return value in AL not %eax # 2 bytes and $1, %al # 2 bytes ret # custom calling convention: is_even: # input in RDI # returns in ZF. ZF=1 means even test $1, %dil # 4 bytes.

Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?

阅读更多关于 Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?

问题 I was reading a text book and it has an exercise that write x86-64 assembly code based on C code //Assume that the values of sp and dp are stored in registers %rdi and %rsi int *sp; char *dp; *dp = (char) *sp; and the answer is: //first approach movl (%rdi), %eax //Read 4 bytes movb %al, (%rsi) //Store low-order byte I can understand it but just wondering can't we do sth simple in the first place as: //second approach movb (%rdi), %al //Read one bytes only rather than read all four bytes movb

Are these the smallest possible x86 macros for these stack operations?

阅读更多关于 Are these the smallest possible x86 macros for these stack operations?

问题 I'm making a stack based language as a fun personal project. So, I have some signed/unsigned 32-bit values on the stack and my goal is to write some assembly macros that operate on this stack. Ideally these will be small since they'll be used a lot. Since I'm new to x86 assembly I was wondering if you guys had any tips or improvements you could think of. I'd greatly appreciate your time, thanks! Note: An optimizer is used after the macros are expanded to avoid cases like pop eax; push eax so

Trying to understand clang/gcc __builtin_memset on constant size / aligned pointers

阅读更多关于 Trying to understand clang/gcc __builtin_memset on constant size / aligned pointers

问题 Basically I am trying to understand why both gcc/clang use xmm register for their __builtin_memset even when the memory destination and size are both divisible by sizeof ymm (or zmm for that matter) and the CPU supports AVX2 / AVX512 . and why GCC implements __builtin_memset on medium sized values without any SIMD (again assuming CPU supports SIMD). For example: __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64)); Will compile to: vpcmpeqd %xmm0, %xmm0, %xmm0 vmovdqa %xmm0, (%rdi)

Shorter x86 call instruction

阅读更多关于 Shorter x86 call instruction

问题 For context I am x86 golfing. 00000005 <start>: 5: e8 25 00 00 00 call 2f <cube> a: 50 push %eax Multiple calls later... 0000002f <cube>: 2f: 89 c8 mov %ecx,%eax 31: f7 e9 imul %ecx 33: f7 e9 imul %ecx 35: c3 ret call took 5 bytes even though the offset fit into a single byte! Is there any way to write call cube and assemble with GNU assembler and get a smaller offset? I understand 16 bit offsets could be used, but ideally I'd have a 2 byte instruction like call reg . 回答1: There is no call

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

阅读更多关于 Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

问题 Which is generally faster to test the byte in AL for zero / non-zero? TEST EAX, EAX TEST AL, AL Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about. So AL=EAX and there are no partial-register penalties for reading EAX. Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

阅读更多关于 Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

阅读更多关于 In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

问题 Is there a good way of optimising this code (x86-64) ? mov dword ptr[rsp], 0; mov dword ptr[rsp+4], 0 where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants. Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix. I’m thinking of something like (don’t know if the following

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

阅读更多关于 Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

问题 Consider a simple instruction like mov RCX, RDI # 48 89 f9 The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes): .byte 0x67 mov RCX, RDI # 67 48 89 f9 67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either.