micro-optimization

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

狂风中的少年 提交于 2020-05-09 02:27:56
问题 Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake is similar. ( Workaround: not a problem for zmm16..31 , according to @BeeOnRope's comments which I quoted in Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions? So this strlen could just use vpxord xmm16,xmm16,xmm16

Micro Optimization of a 4-bucket histogram of a large array or list

旧城冷巷雨未停 提交于 2020-04-25 11:30:27
问题 I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc. The loop I have now is the function: "testbenchmark1". I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I

Micro Optimization of a 4-bucket histogram of a large array or list

一笑奈何 提交于 2020-04-25 11:28:36
问题 I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc. The loop I have now is the function: "testbenchmark1". I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I

Micro Optimization of a 4-bucket histogram of a large array or list

被刻印的时光 ゝ 提交于 2020-04-25 11:27:04
问题 I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc. The loop I have now is the function: "testbenchmark1". I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I

Understanding the difference between ++i and i++ at the Assembly Level

▼魔方 西西 提交于 2020-03-01 14:37:27
问题 I know that variations of this question has been asked here multiple times, but I'm not asking what is the difference between the two. Just would like some help understanding the assembly behind both forms. I think my question is more related to the whys than to the what of the difference. I'm reading Prata's C Primer Plus and in the part dealing with the increment operator ++ and the difference between using i++ or ++i the author says that if the operator is used by itself, such as ego++; it

Understanding the difference between ++i and i++ at the Assembly Level

最后都变了- 提交于 2020-03-01 14:35:07
问题 I know that variations of this question has been asked here multiple times, but I'm not asking what is the difference between the two. Just would like some help understanding the assembly behind both forms. I think my question is more related to the whys than to the what of the difference. I'm reading Prata's C Primer Plus and in the part dealing with the increment operator ++ and the difference between using i++ or ++i the author says that if the operator is used by itself, such as ego++; it

Finding an efficient shift/add/LEA instruction sequence to multiply by a given constant (avoiding MUL/IMUL)

假装没事ソ 提交于 2020-02-24 09:59:45
问题 I'm trying to write a C program mult.c that has a main function that receives 1 int argument (parsed with atoi(argv[1]) ), that is some constant k we want to multiply by. This program will generate an assembly file mult.s that implements int mult(int x) { return x * k; } for that constant k . (This is a followup to Efficient Assembly multiplication) For example: if main() in mult.c gets 14 as argument it may generate (though it is not minimal as later emphasized): .section .text .globl mult

Verifying compiler optimizations in gcc/g++ by analyzing assembly listings

纵饮孤独 提交于 2020-01-22 07:27:05
问题 I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ ( g++ -c -g -O2 -Wa,-ahl=file.s file.c ) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to

Why jnz requires 2 cycles to complete in an inner loop

佐手、 提交于 2020-01-20 08:07:45
问题 I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop. The following simple program has an inner loop with fixed size 16: global _start _start: mov rcx, 100000000 .loop_outer: mov rax, 16 .loop_inner: dec rax jnz .loop_inner dec rcx jnz .loop_outer xor edi, edi mov eax, 60 syscall perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete. I then search in Agner's instruction table, conditional jump has 1-2

Address-size override prefix in 64-bit or using 64-bit registers

时光总嘲笑我的痴心妄想 提交于 2020-01-16 08:39:10
问题 in Assembly Addressing (64-bit), which way is better? mov cl, BYTE [ebx + .DATA] or mov cl, BYTE [rbx + .DATA] ? the opcode for first way is : 67 8a 4b .. and the opcode for second way is : 8a 4b .. so if we use 32-bit register, we need to have a '0x67' prefix (Address-size override prefix) so i think we added an extra job !!! but i heard something about (CACHE) and it's better to use '32-bit' instead of '64-bit' so which way is better at all ? and why ? 回答1: TL:DR: you basically never want