x86-64

Error: use of overloaded operator '[]' is ambiguous while building for i386

a 夏天 提交于 2021-02-09 08:29:32
问题 Consider the following code: #include <stdio.h> #include <stdint.h> class test_class { public: test_class() {} ~test_class() {} const int32_t operator[](uint32_t index) const { return (int32_t)index; } operator const char *() const { return "Hello World"; } }; int main(void) { test_class tmp; printf("%d\n", tmp[3]); return 0; } When I use command clang++ -arch i386 test.cc to build those codes, it yields the following on clang++ (Apple LLVM version 9.1.0 (clang-902.0.39.1)): test.cc:24:21:

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

拜拜、爱过 提交于 2021-02-09 04:37:06
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

安稳与你 提交于 2021-02-09 04:34:53
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

狂风中的少年 提交于 2021-02-09 04:33:48
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

大憨熊 提交于 2021-02-09 04:33:11
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

别说谁变了你拦得住时间么 提交于 2021-02-09 04:31:34
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

Performance optimisations of x86-64 assembly - Alignment and branch prediction

故事扮演 提交于 2021-02-08 19:50:37
问题 I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen() , memset() , etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more. For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely

How to write into XMM Registers in LLDB

别等时光非礼了梦想. 提交于 2021-02-08 19:38:04
问题 I am trying to read and write values from registers in python using the LLDB API. For the General Purpose Registers, I have been using the frame.register['register name'].value to read and write register values, which works successfully for me. However, as I approach the Floating Point Registers, I found that this could not be done anymore, as some of the registers, such as the XMM registers do not have a value attribute e.g frame.register['xmm0'].value would return None . I have looked into

pcmpestri character units and countdown - x86-64 asm

|▌冷眼眸甩不掉的悲伤 提交于 2021-02-08 11:12:14
问题 I’m trying to write a minimal loop around pcmpestri in x86-64 asm (actually in-line asm embedded in Dlang using the GDC compiler). There are a couple of things that I don’t understand I you are using pcmpestri with two pointers to strings, are the lengths of the strings in rax and rdx ? If so, what are the units? count in bytes always, or count in chars where 1 count = 2 bytes for uwords ? Does pcmpestri check for short strings? ie len str1 or str2 < 16 bytes or 8 uwords if uwords Does

How do you make a 8 byte call in x64 assembly? [duplicate]

对着背影说爱祢 提交于 2021-02-08 10:57:20
问题 This question already has answers here : Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code (1 answer) How to execute a call instruction with a 64-bit absolute address? (1 answer) Call an absolute pointer in x86 machine code (2 answers) Closed 8 months ago . I am trying to hook a function in a process that is 64 bit, the relative jump is over 4 bytes so I can't do it via normal methods. Is there any way to jump 8 bytes relative or absolute? Cheers if any