x86-64 | 易学教程

Error: use of overloaded operator '[]' is ambiguous while building for i386

阅读更多关于 Error: use of overloaded operator '[]' is ambiguous while building for i386

问题 Consider the following code: #include <stdio.h> #include <stdint.h> class test_class { public: test_class() {} ~test_class() {} const int32_t operator[](uint32_t index) const { return (int32_t)index; } operator const char *() const { return "Hello World"; } }; int main(void) { test_class tmp; printf("%d\n", tmp[3]); return 0; } When I use command clang++ -arch i386 test.cc to build those codes, it yields the following on clang++ (Apple LLVM version 9.1.0 (clang-902.0.39.1)): test.cc:24:21:

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

Performance optimisations of x86-64 assembly - Alignment and branch prediction

阅读更多关于 Performance optimisations of x86-64 assembly - Alignment and branch prediction

问题 I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen() , memset() , etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more. For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely

How to write into XMM Registers in LLDB

阅读更多关于 How to write into XMM Registers in LLDB

问题 I am trying to read and write values from registers in python using the LLDB API. For the General Purpose Registers, I have been using the frame.register['register name'].value to read and write register values, which works successfully for me. However, as I approach the Floating Point Registers, I found that this could not be done anymore, as some of the registers, such as the XMM registers do not have a value attribute e.g frame.register['xmm0'].value would return None . I have looked into

pcmpestri character units and countdown - x86-64 asm

阅读更多关于 pcmpestri character units and countdown - x86-64 asm

问题 I’m trying to write a minimal loop around pcmpestri in x86-64 asm (actually in-line asm embedded in Dlang using the GDC compiler). There are a couple of things that I don’t understand I you are using pcmpestri with two pointers to strings, are the lengths of the strings in rax and rdx ? If so, what are the units? count in bytes always, or count in chars where 1 count = 2 bytes for uwords ? Does pcmpestri check for short strings? ie len str1 or str2 < 16 bytes or 8 uwords if uwords Does

How do you make a 8 byte call in x64 assembly? [duplicate]

阅读更多关于 How do you make a 8 byte call in x64 assembly? [duplicate]

问题 This question already has answers here : Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code (1 answer) How to execute a call instruction with a 64-bit absolute address? (1 answer) Call an absolute pointer in x86 machine code (2 answers) Closed 8 months ago . I am trying to hook a function in a process that is 64 bit, the relative jump is over 4 bytes so I can't do it via normal methods. Is there any way to jump 8 bytes relative or absolute? Cheers if any