Is a mov to a segmentation register slower than a mov to a general purpose register?

我与影子孤独终老i 提交于 2019-12-02 01:51:10

mov %eax, %ebx between general-purpose registers is one of the most common instructions. Modern hardware supports it extremely efficiently, often with special cases that don't apply to any other instruction. On older hardware, it's always been one of the cheapest instructions.

On Ivybridge and later, it doesn't even need an execution unit and has zero latency. It's handled in the register-rename stage. Can x86's MOV really be "free"? Why can't I reproduce this at all? Even on earlier CPUs, it's 1 uop for any ALU port (so typically 3 or 4 per clock throughput).

On AMD Piledriver / Steamroller, mov r32,r32 and r64,r64 can run on AGU ports as well as ALU ports, giving it 4 per clock throughput vs. 2 per clock for add, or for mov on 8 or 16-bit registers (which have to merge into the destination).


mov to a segment reg is a fairly rare instruction in typical 32 and 64-bit code. It is part of what kernels do for every system call (and probably interrupts), though, so so making it efficient will speed up the fast-path for system-call and I/O intensive workloads. So even though it appears in only a few places, it can run a fair amount. But it's still of minor importance compared to mov r,r!

mov to a segment reg is slow: it triggers a load from the GDT or LDT to update the descriptor cache, so it's microcoded.

This is the case even in x86-64 long mode; the segment base/limit fields in the GDT entry are ignored, but it still has to update the descriptor cache with other fields from the segment descriptor, including the DPL (descriptor privilege level) which does apply to data segments.


Agner Fog's instruction tables list uop counts and throughput for mov sr, r (Intel synax, mov to segment reg) for Nehalem and earlier CPUs. He stopped testing seg regs for later CPUs because it's obscure and not used by compilers (or humans optimizing by hand), but the counts for SnB-family are probably somewhat similar. (InstLatx64 doesn't test seg regs either, e.g. not in this Sandybridge instruction-timing test)

MOV sr,r on Nehalem (presumably tested in protected mode or long mode):

  • 6 fused-domain uops for the front end
  • 3 uops for ALU ports (p015)
  • 3 uops for the load port (p2)
  • throughput: 1 per 13 cycles (for repeating this instruction thousands of times in a giant loop). IDK if the CPU renames segment regs. If not, it might stall later loads (or all later instructions?) until the descriptor caches were updated and the mov to sr instruction retires. i.e. I'm not sure how much impact this would have on out-of-order execution of surrounding code.

Other CPUs are similar:

  • PPro/PII/PIII (original P6): 8 uops for p0, no throughput listed. 5 cycle latency. (Remember this uarch was designed before it's 1995 release, when 16-bit code was still common. This is why P6-family does partial-register renaming for integer registers (AL,AH separate from AX))
  • Pentium 4: 4 uops + 4 microcode, 14c throughput.

    Latency = 12c 16-bit real or vm86 mode, 24c in 32-bit protected mode. 12c is what he lists in the main table, so presumably his latency numbers for other CPUs are real-mode latencies, too, where writing a segment reg just sets the base = sreg<<4.)

    Reading a segment reg is slow on P4, unlike other CPUs: 4 uops + 4 microcode, 6c throughput

  • P4 Prescott: 1 uop + 8 microcode. 27c throughput. Reading a segment reg = 8c throughput.

  • Pentium M: 8 uops for p0, same as PIII.

  • Conroe/Merom and Wolfdale/Penryn (first and second-gen Core2): 8 fused-domain uops, 4 ALU (p015), 4 load/AGU (p2). one per 16 cycle throughput, the slowest of any CPU where Agner tested it.

  • Skylake (my testing reloading them with the value I read outside the loop): in a loop with just dec/jnz: 10 fused-domain uops (front-end), 6 unfused-domain (execution units). one per 18c throughput.

    In a loop writing to 4 different seg regs (ds/es/fs/gs) all with the same selector: four mov per per 25c throughput, 6 fused/unfused domain uops. (Perhaps some are getting cancelled?)

    In a loop writing to ds 4 times: one iter per 72c (one mov ds,eax per 18c). Same uop count: ~6 fused and unfused per mov.

    This seems to indicate that Skylake does not rename segment regs: a write to one has to finish before the next write can start.

  • K7/K8/K10: 6 "ops", 8c throughput.

  • Atom: 7 uops, 21c throughput

  • Via Nano 2000/3000: unlisted uops, 20 cycles throughput and latency. Nano 3000 has 0.5 cycle throughput for reading a seg reg (mov r, sr). No latency listed, which is weird. Maybe he's measuring seg-write latency in terms of when you can use it for a load? like mov eax, [ebx] / mov ds, eax in a loop?

Weird Al was right, It's All About the Pentiums

In-order Pentium (P5 / PMMX) had cheaper mov-to-sr: Agner lists it as taking ">= 2 cycles", and non-pairable. (P5 was in-order 2-wide superscalar with some pairing rules on which instructions could execute together). That seems cheap for protected mode, so maybe the 2 is in real mode and protected mode is the greater-than? We know from his P4 table notes that he did test stuff in 16-bit mode back then.


Agner Fog's microarch guide says that Core2 / Nehalem can rename segment registers (Section 8.7 Register renaming):

All integer, floating point, MMX, XMM, flags and segment registers can be renamed. The floating point control word can also be renamed.

(Pentium M could not rename the FP control word, so changing the rounding mode blocks OoO exec of FP instructions. e.g. all earlier FP instructions have to finish before it can modify the control word, and later ones can't start until after. I guess segment regs would be the same but for load and store uops.)

He says that Sandybridge can "probably" rename segment regs, and Haswell/Broadwell/Skylake can "perhaps" rename them. My quick testing on SKL shows that writing the same segment reg repeatedly is slower than writing different segment regs, which indicates that they're not fully renamed. It seems like an obvious thing to drop support for, because they're very rarely modified in normal 32 / 64-bit code.

And each seg reg is usually only modified once at a time, so multiple dep chains in flight for the same segment register is not very useful. (i.e. you won't see WAW hazards for segment regs in Linux, and WAR is barely relevant because the kernel won't use user-space's DS for any memory references in a kernel entry-point. (I think interrupts are serializing, but entering the kernel via syscall could maybe still have a user-space load or store in flight but not executed yet.)

In chapter 2, which explains out-of-order exec in general (all CPUs except P1 / PMMX), 2.2 register renaming says that "possibly segment registers can be renamed", but IDK if he means that some CPUs do and some don't, or if he's not sure about some old CPUs. He doesn't mention seg reg renaming in the PII/PII or Pentium-M sections, so I can't tell you about the old 32-bit-only CPUs you're apparently asking about. (And he doesn't have a microarch guide section for AMD before K8.)

You could benchmark it yourself if you're curious, with performance counters. (See Are loads and stores the only instructions that gets reordered? for an example of how to test for blocking out-of-order execution, and Can x86's MOV really be "free"? Why can't I reproduce this at all?) for basics on using perf on Linux to do microbenchmarks on tiny loops.


Reading a segment reg

mov from a segment reg is relatively cheap: it only modifies a GP register, and CPUs are good at writes to GP registers, with register-renaming etc. Agner Fog found it was a single uop on Nehalem. Fun fact, on Core2 / Nehalem it runs on the load port, so I guess that's where segment regs are stored on that microarchitecture.

(Except on P4: apparently reading seg regs was expensive there.)

A quick test on my Skylake (in long mode) shows that mov eax, fs (or cs or ds or whatever) is 2 uops, one of which only runs on port 1, and the other can run on any of p0156. (i.e. it runs on ALU ports). It has a throughput of 1 per clock, bottlenecked on port 1.


You normally only mess with FS or GS for thread-local storage, and you don't do it with mov to FS, you make a system call to have the OS use use wrfsbase to modify the segment base in the cached segment description.


N.B I'm concerned with old x86 linux cpus, not modern x86_64 cpus, where segmentation works differently.

You said "Linux", so I assume you mean protected mode, not real mode (where segmentation works completely differently). Probably mov sr, r decodes differently in real mode, but I don't have a test setup where I can profile with performance counters for real or VM86 mode running natively.

FS and GS in long mode work basically the same as in protected mode, it's the other seg regs that are "neutered" in long mode. I think the Agner Fog's Core2 / Nehalem numbers are probably similar to what you'd see in a PIII in protected mode. They're part of the same microarchitecture family. I don't think we have a useful number for P5 Pentium segment register writes in protected mode.

(Sandybridge was the first of a new family derived from P6-family with significant internal changes, and some ideas from P4 implemented a different (better) way, e.g. SnB's decoded-uop cache is not a trace cache. But more importantly, SnB uses a physical register file instead of keeping values right in the ROB, so its register renaming machinery is different.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!