What are some ideas for cross-modifying code that could trigger unexpected behavior on x86 or x86-x64 systems, where everything is done correctly in the cross
Think of a processor that has a very long instruction pipeline where registers and memory are only modified in the last pipeline stage. When you write self modifying code for this processor and modify an instruction in memory that is already present in the pipeline, the modification will have no effect. In this case the behaviour of the program depends on how long the pipeline of the processor is.
To make new processors with longer pipelines behave exactly as older models, Intel processors include a mechanism that flushes (empties) the pipeline if this case is detected. After the flush, the modified code is fetched into the pipeline, so the new processor behaves exactly as old ones.
A serializing instruction is another way to flush the pipeline. When it reaches the end of the pipeline, the pipeline is flushed and starts fetching again after the serializing instruction.
So what the errata is essentially saying is that some processor models do not check if writes from other processors overwrite instructions that are already executing in their pipeline. The check works only for local writes, not for external writes. But if you insert a serializing instruction you force the processor to flush the pipeline and everything will behave as expected.
To reproduce the behaviour described in the errata you need to make sure that the code you are modifying from one processor is inside the pipeline of the other processor. Take a look at branch prediction (decides which code path is inside the pipeline) and synchronization primitives.
The odds you can repro this behavior are very close to zero. First keep in mind that self- and cross-modifying code is nothing unusual. Happens every day when, say, you use a debugger and set a breakpoint or modify memory. Or when a DLL gets loaded and it needs to be relocated to a different address.
Even if you intentionally omit the serializing instruction, you'd still have a hard time avoiding one to tinker with the code of the other processor. Simple things you need, like implementing the synchronization or changing the page protection attributes so you can modify the code are very likely to go through a code path inside the operating system that will serialize.
Furthermore, the errata and the FUD email you quoted are old, they date back to the time that multi-core processors first became commonly available. Intel always documents recommended approaches that work on any processor, including ones that did not have the erratum fixed. Whether current models still actually require the serializing instruction is hard to discover.
Best not to waste your time on this.