How is x86 instruction cache synchronized?

心不动则不痛 提交于 2019-11-27 19:06:01

What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)

11.6 SELF-MODIFYING CODE

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated.

But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:

Applications that include self-modifying code use the same linear address for modifying and fetching the instruction. Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue.

For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):

3.3.1.2.1 Self-Modifying Code

When a processor modifies any memory location that can contain an instruction, software must ensure that the instruction cache is made consistent with data memory and that the modifications are made visible to the instruction fetching mechanism. This must be done even if the cache is disabled or if the page is marked caching-inhibited.

It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.

The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.

Hope this help.

It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.

The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.

By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.

I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!

Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!

This is just for explanation purpose. Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"... Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)

  1. mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
  2. "a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
  3. "a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
  4. Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
  5. Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
  6. If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!