x86

Building backward compatible binaries with newer CPU instructions support

那年仲夏 提交于 2021-01-27 12:41:41
问题 What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not? For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2 ), and another that does bit manipulation "by

What is “Code” in Linux Kernel crash messages?

别来无恙 提交于 2021-01-27 11:50:55
问题 I have the following stack trace and crash information after the Linux kernel failed to load: [ 3.684670] ------------[ cut here ]------------ [ 3.695507] Bad FPU state detected at fpu__clear+0x91/0xc2, reinitializing FPU registers. [ 3.695508] traps: No user code available. [ 3.704745] invalid opcode: 0000 [#1] PREEMPT [ 3.715304] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.50-android-x86-geeb7e76-dirty #1 [ 3.724594] Hardware name: AAEON UP-APL01/UP-APL01, BIOS UPA1AM21 09/01/2017 [ 3

What is “Code” in Linux Kernel crash messages?

半腔热情 提交于 2021-01-27 11:49:25
问题 I have the following stack trace and crash information after the Linux kernel failed to load: [ 3.684670] ------------[ cut here ]------------ [ 3.695507] Bad FPU state detected at fpu__clear+0x91/0xc2, reinitializing FPU registers. [ 3.695508] traps: No user code available. [ 3.704745] invalid opcode: 0000 [#1] PREEMPT [ 3.715304] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.50-android-x86-geeb7e76-dirty #1 [ 3.724594] Hardware name: AAEON UP-APL01/UP-APL01, BIOS UPA1AM21 09/01/2017 [ 3

What is “Code” in Linux Kernel crash messages?

假如想象 提交于 2021-01-27 11:47:52
问题 I have the following stack trace and crash information after the Linux kernel failed to load: [ 3.684670] ------------[ cut here ]------------ [ 3.695507] Bad FPU state detected at fpu__clear+0x91/0xc2, reinitializing FPU registers. [ 3.695508] traps: No user code available. [ 3.704745] invalid opcode: 0000 [#1] PREEMPT [ 3.715304] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.50-android-x86-geeb7e76-dirty #1 [ 3.724594] Hardware name: AAEON UP-APL01/UP-APL01, BIOS UPA1AM21 09/01/2017 [ 3

How does “+&r” differ from “+r”?

风流意气都作罢 提交于 2021-01-27 07:02:58
问题 GCC's inline assembler recognizes the declarators =r and =&r . These make sense to me: the =r lets the assembler reuse an input register for output. However, GCC's inline assembler also recognizes the declarators +r and +&r . These make less sense to me. After all, isn't the distinction between +r and +&r a distinction without a difference? Does the +r alone not suffice to tell the compiler to reserve a register for the sole use of a single variable? For example, what is wrong with the

MFENCE/SFENCE/etc “serialize memory but not instruction execution”?

假装没事ソ 提交于 2021-01-27 06:52:11
问题 Intel's System Programming Guide, section 8.3, states regarding MFENCE/SFENCE/LFENCE: "The following instructions are memory-ordering instructions, not serializing instructions. These drain the data memory subsystem. They do not serialize the instruction execution stream. " I'm trying to figure out why this matters. In multi-threaded code, writes/reads to memory are exactly what need to happen in a well-defined order. Of course, the order which I/O happens in might matter, but I/O

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

喜你入骨 提交于 2021-01-27 06:28:43
问题 Which is generally faster to test the byte in AL for zero / non-zero? TEST EAX, EAX TEST AL, AL Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about. So AL=EAX and there are no partial-register penalties for reading EAX. Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

£可爱£侵袭症+ 提交于 2021-01-27 06:28:10
问题 Which is generally faster to test the byte in AL for zero / non-zero? TEST EAX, EAX TEST AL, AL Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about. So AL=EAX and there are no partial-register penalties for reading EAX. Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access

How can I determine what architectures gcc supports?

让人想犯罪 __ 提交于 2021-01-27 05:51:49
问题 GCC supports a -march switch that allows you to specify the architecture you are targeting - allowing it to tune instruction sequences for that platform as well as using instructions that might be available on the platform which aren't available on the "default" or base version of the architecture. For example, -march=skylake will tell the compiler to target Skylake CPUs, including using instruction sets available on Skylake such as AVX2. How can I tell what values for -march the local

How can I determine what architectures gcc supports?

百般思念 提交于 2021-01-27 05:50:52
问题 GCC supports a -march switch that allows you to specify the architecture you are targeting - allowing it to tune instruction sequences for that platform as well as using instructions that might be available on the platform which aren't available on the "default" or base version of the architecture. For example, -march=skylake will tell the compiler to target Skylake CPUs, including using instruction sets available on Skylake such as AVX2. How can I tell what values for -march the local