micro-optimization

An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

强颜欢笑 提交于 2020-06-16 19:07:14
问题 A following-up question for Why does this `std::atomic_thread_fence` work As a dummy interlocked operation is better than _mm_mfence , and there are quite many ways to implement it, which interlocked operation and on what data should be used? Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers. 回答1: Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that

An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

隐身守侯 提交于 2020-06-16 19:07:04
问题 A following-up question for Why does this `std::atomic_thread_fence` work As a dummy interlocked operation is better than _mm_mfence , and there are quite many ways to implement it, which interlocked operation and on what data should be used? Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers. 回答1: Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that

Are two store buffer entries needed for split line/page stores on recent Intel?

心不动则不痛 提交于 2020-06-08 16:57:10
问题 It is generally understood that one store buffer entry is allocated per store, and this store buffer entry holds the store data and physical address 1 . In the case that a store crosses a 4096-byte page boundary, two different translations may be needed, one for each page, and hence two different physical addresses may need to be stored. Does this mean that page-crossing stores take 2 store buffer entries? If so, does it apply also to line-crossing stores? 1 ... and perhaps some/all of the

repz ret: why all the hassle?

拈花ヽ惹草 提交于 2020-05-23 09:44:12
问题 The issue of the repz ret has been covered here [1] as well as in other sources [2, 3] quite satisfactorily. However, reading neither of these sources, I found answers to the following: What is the actual penalty in a quantitative comparison with ret or nop; ret ? Especially in the latter case – is decoding one extra instruction (and an empty one at that!) really relevant, when most functions either have 100+ of those or get inlined? Why did this never get fixed in AMD K8, and even made its

Is it faster to prepend to a string with substr?

浪子不回头ぞ 提交于 2020-05-14 18:42:28
问题 I just found code that prepends with substr( $str, 0, 0, $prepend ) my $foo = " world!" substr( $foo, 0, 0, "Hello " ); Is this any faster than my $foo = " world!" $foo = "Hello $foo"; 回答1: Optrees If we compare the two optrees the top has b <@> substr[t2] vK/4 ->c - <0> ex-pushmark s ->7 7 <0> padsv[$foo:2,3] sM ->8 8 <$> const[IV 0] s ->9 9 <$> const[IV 0] s ->a a <$> const[PV "Hello "] s ->b While the bottom has 8 <+> multiconcat(" world!",-1,7)[$foo:2,3] sK/TARGMY,STRINGIFY ->9 - <0> ex

AVX 512 vs AVX2 performance for simple array processing loops [closed]

假装没事ソ 提交于 2020-05-13 14:49:05
问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I'm currently working on some optimizations and comparing vectorization possibilities for DSP applications, that seem ideal for AVX512, since these are just simple uncorrelated array processing loops. But on a new i9 I didn't measure any reasonable improvements when using AVX512 compared to AVX2. Any

How can I resolve data dependency in pointer arrays?

旧巷老猫 提交于 2020-05-12 03:17:00
问题 If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example int* data[2]; int a, b; a = b = 0; for (auto i = 0ul; i < 2; ++i) { // Case 3: 2.5 sec data[i] = &a; // Case 2: 1.25 sec // if (i & 1) // data[i] = &a; // else // data[i] = &b; } for (auto i = 0ul; i < 1000000000; ++i) { // Case 1: 0.5sec // asm volatile("" : "+g"(i)); // deoptimize

How can I resolve data dependency in pointer arrays?

感情迁移 提交于 2020-05-12 03:14:36
问题 If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example int* data[2]; int a, b; a = b = 0; for (auto i = 0ul; i < 2; ++i) { // Case 3: 2.5 sec data[i] = &a; // Case 2: 1.25 sec // if (i & 1) // data[i] = &a; // else // data[i] = &b; } for (auto i = 0ul; i < 1000000000; ++i) { // Case 1: 0.5sec // asm volatile("" : "+g"(i)); // deoptimize

80286: Which is the fastest way to multiply by 10?

浪子不回头ぞ 提交于 2020-05-09 06:31:07
问题 To multiply a number by any any multiple of 2, I'll shift it those many times. Is there any such technique to multiply a number by 10 in less cycles? 回答1: The 80286 did not have a barrel shifter, that was introduced with the 80386. According to the timing tables in the Microsoft Macro Assembler 5.0 documentation (1987), SHL reg, immed8 takes 5+n cycles, whereas SHL reg, 1 takes 2 cycles. ADD reg, reg takes 2 cycles, as does MOV reg, reg . IMUL reg16, immed takes 21 cycles. Therefore, the

80286: Which is the fastest way to multiply by 10?

荒凉一梦 提交于 2020-05-09 06:31:05
问题 To multiply a number by any any multiple of 2, I'll shift it those many times. Is there any such technique to multiply a number by 10 in less cycles? 回答1: The 80286 did not have a barrel shifter, that was introduced with the 80386. According to the timing tables in the Microsoft Macro Assembler 5.0 documentation (1987), SHL reg, immed8 takes 5+n cycles, whereas SHL reg, 1 takes 2 cycles. ADD reg, reg takes 2 cycles, as does MOV reg, reg . IMUL reg16, immed takes 21 cycles. Therefore, the