Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?

问题

I was reading a text book and it has an exercise that write x86-64 assembly code based on C code

//Assume that the values of sp and dp are stored in registers %rdi and %rsi

int *sp;
char *dp;
*dp = (char) *sp;

and the answer is:

//first approach

movl (%rdi), %eax    //Read 4 bytes
movb %al, (%rsi)     //Store low-order byte

I can understand it but just wondering can't we do sth simple in the first place as:

//second approach

movb (%rdi), %al    //Read one bytes only rather than read all four bytes
movb %al, (%rsi)     //Store low-order byte

isn't the second approach more concise and straightforward compared to the first approach which is a little bit unnescceary since we only care the lower byte of %rdi, and not really interested in its upper 3 bytes.

回答1:

Yes, your byte-load way is correct but it's not actually more efficient on most CPUs.
TL:DR: Generally avoid writing to byte or 16-bit registers when you have equally convenient options that don't do that.

(And BTW, the suggestions you got in comments were both wrong: x86 is little-endian, and store-forwarding problems are very unlikely (although possible maybe on some older CPUs, IDK that might not be totally wrong).)

Writing a partial register (narrower than 32-bit so it doesn't implicitly zero-extend into the full register) has a false dependency on the old value on some microarchitectures. i.e. movb (%rdi), %al decodes on Intel Haswell/Skylake as a micro-fused load+merge ALU operation. (Why doesn't GCC use partial registers?. Also for Intel Haswell/Skylake specifically, this has a lot of detail.)

It would be more efficient to movzbl (%rdi), %eax to just do a zero-extending byte load.

Or since we can assume that the last store to (%rdi) was dword or wider (so store-forwarding will be efficient if it's still in flight), it is actually most efficient to do a dword load with movl (%rdi), %eax. That avoids possible partial register penalties, and has smaller machine-code size than movzbl (smaller is better, as a tie-break between otherwise equal options in terms of uops). Also, some old AMD CPUs run movzbl slightly less efficiently than a dword mov load. (Like the zero-extending needs an ALU port).

(Most CPUs run movzbl "for free" in a load port, some also run movsbl sign-extension in a load port without needing any ALU port, notably Intel Sandybridge-family.)

Store forwarding is not a problem: all (?) current CPUs can forward efficiently from a dword store to a byte reload of any of the individual bytes, and definitely the low byte, especially when the dword store is aligned (like a C int will be). See https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

Of course, if you have a use for char value sign- or zero-extended into a register later, load that way.

Or even better, as @Ira points out, if you're optimizing this code along with something that stored to *sp, you can ideally just use whatever is in the register and optimize away the store/reload. (It's undefined behaviour in C for any other thread to asynchronously change that memory because it's int *, not volatile or _Atomic int*.)

回答2:

(OP changed the question from a more general one with an example to a very specific one, which might explain why this answer looks funny wrt to the current question.)

The more general answer to your question, is that for any operation in an HLL that you intend to compile to machine code, there are usually many ways to write machine instructions to do just that operation.

A good compiler will know of many of these variants. Its problem is to choose, for all the operations in your program, the generally more efficient variations for each operator, in such a way that they stitch together to achieve a working program. For instance, if one HLL operation is implemented which leaves its result in a register, and a successor HLL operation is supposed to use that result, then the compiler much choose implementations of the first operator and the second, in which the first leaves the value in a register, and the second happens to use that register as an input or the program will not work.

When you consider that a real program consists of thousands of HLL operators, and their individual implementations must all be consistent, you can see the compiler has a very complicated job making sure everything fits together and it is reasonably efficient.

来源：https://stackoverflow.com/questions/62787483/instructions-to-copy-the-low-byte-from-an-int-to-a-char-simpler-to-just-do-a-by

标签

assembly

x86-64

micro-optimization

instructions