How does CPU perform operation that manipulate data that's less than a word size

问题

I had read that when CPU read from memory, it will read word size of memory at once (like 4 bytes or 8 bytes). How can CPU achieve something like:

 mov     BYTE PTR [rbp-20], al

where it copies only one byte of data from al to the stack. (given the data bus width is like 64 bit wide) Will be great if anyone can provide information on how it's implemented on the hardware level.

And also, as we all know that when CPU execute program, it has program counter or instruction pointer that points to the address of the next instruction, and the control unit will fetch that instruction to the memory data register and executes it later. let's say:

0:  b8 00 00 00 00          mov    eax,0x0

is 5 byte code long (on x84) and

0:  31 c0                   xor    eax,eax

is 2 byte code long, they have various length of size.

if the control unit wants to fetch these instructions, does it:

fetch 8 bytes of byte code (might consists of multiple instructions) and then execute only part of them.
fetch instructions that's less than 8 bytes (still read 8 bytes from memory but other bytes will be ignore)
the instructions is already padded (by compiler or something).

what about instructions like :

0:  48 b8 5c 8f c2 f5 28    movabs rax,0x28f5c28f5c28f5c
7:  5c 8f 02

which exceeds the word size, how are they being handled by the CPU?

回答1:

x86 is not a word-oriented architecture at all. Instructions are variable length with no alignment.

"Word size" is not a meaningful term on x86; some people may use it to refer to the register width, but instruction fetch / decode has nothing to do with the integer registers.

In practice on most modern x86 CPUs, instruction fetch from the L1 instruction cache happens in aligned 16-byte or 32-byte fetch blocks. Later pipeline stages find instruction boundaries and decode up to 5 instructions in parallel (e.g. Skylake). See David Kanter's write-up of Haswell for a block diagram of the front-end showing 16-byte instruction fetch from L1i cache.

But note that modern x86 CPUs also use a decoded-uop cache so they don't have to deal with the hard-to-decode x86 machine code for code that runs very frequently (e.g. inside a loop, even a large loop). Dealing with variable-length unaligned instructions is a significant bottleneck on older CPUs.

See Can modern x86 hardware not store a single byte to memory? for more about how the cache absorbs stores to normal memory regions (MTRR and/or PAT set to WB = Write-Back memory type).

The logic that commits stores from the store buffer to L1 data cache on modern Intel CPUs handles any store of any width as long as it's fully contained within one 64-byte cache line.

Non-x86 CPUs that are more word-oriented (like ARM) commonly use a read-modify-write of a cache word (4 or 8 bytes) to handle narrow stores. See Are there any modern CPUs where a cached byte store is actually slower than a word store? But modern x86 CPUs do spend the transistors to make cached byte stores or unaligned wider stores exactly as efficient as aligned 8-byte stores into cache.

given the data bus width is like 64 bit wide

Modern x86 has memory controllers built-in to the CPU. That DDR[1234] SDRAM bus has 64 data lines, but a single read or write command initiates a burst of 8 transfers, transferring 64 bytes of data. (Not coincidentally, 64 bytes is the cache line size for all existing x86 CPUs.)

For a store to an uncacheable memory region (i.e. if the CPU is configured to treat that address as uncacheable even though it's backed by DRAM), a single-byte or other narrow store is possible using the DQM byte-mask signals which tell the DRAM memory which of the 8 bytes are actually to be stored from this burst transfer.

(Or if that's not supported (which may be the case), the memory controller may have to read the old contents and merge, then store the whole line. Either way, 4-byte or 8-byte chunks are not the significant unit here. DDR burst transfers can be cut short, but only to 32 bytes down from 64. I don't think an 8-byte aligned write is actually very special at the DRAM level. It is guaranteed to be "atomic" in the x86 ISA, though, even on uncacheable MMIO regions.)

A store to an uncacheable MMIO region will result in a PCIe transaction of the appropriate size, up to 64 bytes.

Inside the CPU core, the bus between data cache and execution units can be 32 or 64 bytes wide. (Or 16 bytes on current AMD). And transfers of cache lines between L1d can L2 cache is also done over a 64-byte wide bus, on Haswell and later.

回答2:

The CPU never (or rarely) talks to the data bus and the memory at all -- instead, the data bus transfers data between the memory and the cache, and the CPU talks to the cache. The CPU's data cache interface can write to single bytes in a cache line, or multiple bytes. So with your

mov     BYTE PTR [rbp-20], al

example, to execute this, the CPU will first ensure that the line containing that byte is in the data cache (which likely involves transferring one or more bus-sized blocks from the memory), and then will write to that byte.

Decoding instructions comes from the instruction cache, which is optimized to stream data into the decoders, so they can deal with unaligned instructions that cross word boundaries.

回答3:

The bus on the edge of the cpu is these days probably 64 bits. but either way 16, 32, 64, etc. Also the designs can/do vary but the kind of thing you are asking about is the processor for a read will issue a bus sized read, so for address 0x1001 a read of 0x1000 will happen in some form (sometimes the memory controller or cache controller or whatever is on the other side of this bus will be the one to strip the lower bits off the address). The next layer for a read will ideally do a word or bus sized read. You may or may not have a cache here, doesnt matter with respect to this question, if so then if a hit then that width will be read and sent back to the cpu, on a miss some number of units generally many times the bus width will be read as a cache line, the word/or whatever unit will be sent back to the cpu. for a read the cpu generally isolates the sub bus number of bytes from that read and consumes them ignoring the rest. note that this is not wasteful, its the opposite.

Writes are where the performance problem is. If you write an unaligned or certainly less than a full bus width then you need to indicate to the memory controller valid from invalid bits or byte lanes usually byte lanes in some form. One way is to have a byte mask so for a 32 bit bus you would have 4 bits of byte mask one to represent each of the 8 bit bytes going across that bus at once. The memory controller or cache controller will then need to do a read-modify-write (there are exceptions but in this case just roll with it). So a write of one byte to 0x1001 will leave the cpu on this inner/close bus with that address or 0x1000 as the address, a byte mask of 0b0010 and the data value in the form of a 32 bit number of which only the second byte lane has valid bits the others can be garbage or zeros or whatever. For the kind of systems a quote/question like this are asked about mean the outer layers of memory are accessed in these wide units, byte enables are possible but assume not used. The cache itself is likely made up of wide srams, 32 bit would be sane in this case, so to write a single byte location in the cache sram, requires a read of those 32 bits, modification of the 8 bits that are changing and then write the sram location. this has absolutely nothing to do with cache write throughs or write backs or whatever completely irrelevant. this is the inner workings of the sram buried deep in the cache. it wastes chip real estate to build a cache out of 8 bit wide memories, also multiplies the number of signals, causing some of that wasted space to route them, plus logic to control them, all wasted. So a wider memory will be used for a somewhat sane design. Possible more like 39 or 40 bits wide to have some ecc on those srams.

similar if not same if you dont have a cache or the cache is not enabled. you can download axi documentation from arm you can look up some other known busses. the inner workings of an x86 though where this activity would be visible would really have no business being documented outside intel or amd.

an x86 has significant overhead to deal with the instruction set that you shouldnt see the performance hit of these writes. other architectures with less overhead you can/will see these performance hits.

回答4:

Caches are discussed in most books on Computer Architecture. At the level of the question being asked, "Digital Design and Computer Architecture" by Harris & Harris or at that level might suffice.

You're probably looking for a block diagram like the one I enclose below, to quickly understand the pipeline and move on. I am not aware of a book that would do that. I took < 30 min to draw this (& strictly for fun) - take it for what it's worth. But if you discover errors or have other corrections, post it here for future visitors of this page.

来源：https://stackoverflow.com/questions/56436206/how-does-cpu-perform-operation-that-manipulate-data-thats-less-than-a-word-size

标签

assembly

x86

cpu

cpu-architecture

hardware-programming