When writing x86-64 user-space assembly and comparing two pointer values, should we use signed conditions such as jl
and jge
or unsign
TL:DR: intptr_t
might be best in some cases because the signed-overflow boundary is in the middle of the "non-canonical hole". Treating a value as negative instead of huge may be better if wrapping from zero to 0xFF...FF
or vice versa is possible, but pointer+size for any valid size can't wrap a value from INT64_MAX
to INT64_MIN
.
Otherwise you probably want unsigned for the "high half" (high bit set) to compare as above the low half.
It depends exactly what you want to know about two pointers!
A previous edit of your question gave ptrA < ptrB - C
as the use-case you're interested in. e.g. an overlap check with ptrA < ptrB - sizeA
, or maybe an unrolled SIMD loop condition with current < endp - loop_stride
. Discussion in comments has been about this kind of thing, too.
So what you're really doing is forming ptrB - C
as a pointer that's potentially outside the object you were interested in, and which may have wrapped around (unsigned). (Good observation that stuff like this may be why C and C++ make it UB to form pointers outside of objects, but they do allow one-past-the-end which has unsigned wrapping at the end of the highest page, if the kernel even lets you map it.) Anyway, you want to use a signed comparison so it "still works" without having to check for wraparound, or check the sign of C
or any of that stuff. This is still a lot more specific than most of the question.
Yes, for "related" pointers derived from the same object with reasonable sizes, signed compare is safe on current hardware, and could only break on unlikely / distant-future machines with hardware support for full 64-bit virtual addresses. Overlap checks are also safe with unsigned if both pointers are in the low half of the canonical range, which I think is the case for user-space addresses on all the mainstream x86-64 OSes.
As you point out, unsigned ptrA < ptrB - C
can "fail" if ptrB - C
wraps (unsigned wraparound). This can happen in practice for static addresses that are closer to 0 than the size of C
.
Usually the low 64kiB is not mapable (e.g. on Linux, most distros ship with the sysctl vm.mmap_min_addr = 65536
, or at least 4096. But some systems have it =0 for WINE). Still, I think it's normal for kernels to not give you a zero page unless you request that address specifically, because it stops NULL deref from faulting (which is normally highly desirable for security and debugability reasons).
This means the loop_stride case is usually not a problem. The sizeA
version can usually be done with ptrA + sizeA < ptrB
, and as a bonus you can use LEA to add instead of copy + subtract. ptrA+sizeA
is guaranteed not to wrap unless you have objects that wrap their pointer from 2^64-1 to zero (which works even with a page-split load at the wraparound, but you'll never see it in a "normal" system because addresses are normally treated as unsigned.)
So when can it fail with a signed compare? When ptrB - C
has signed wraparound on overflow. Or if you ever have pointers to high-half objects (e.g. into Linux's vDSO pages), a compare between a high-half and low-half address might give you an unexpected result: you will see "high-half" addresses as less than "low-half" addresses. This happens even though the ptrB - C
calculation doesn't wrap.
(We're only talking about asm directly, not C, so there's no UB, I'm just using C notation for sub
or lea
/ cmp
/ jl
.)
Signed wraparound can only happen near the boundary between 0x7FFF...
and 0x8000...
. But that boundary is extremely far from any canonical address. I'll reproduce a diagram of x86-64 address space (for current implementations where virtual address are 48 bits) from another answer. See also Why in 64bit the virtual address are 4 bits short (48bit long) compared with the physical address (52 bit long)?.
Remember, x86-64 faults on non-canonical addresses. That means it checks that 48-bit virtual address are properly sign-extended to 64 bits, i.e. that bits [63:48]
match bit 47
(numbering from 0).
+----------+
| 2^64-1 | 0xffffffffffffffff
| ... | high half of canonical address range
| 2^64-2^47| 0xffff800000000000
+----------+
| |
| unusable | Not to scale: this is 2^15 times larger than the top/bottom ranges.
| |
+----------+
| 2^47-1 | 0x00007fffffffffff
| ... | low half of canonical range
| 0 | 0x0000000000000000
+----------+
Intel has proposed a 5-level page-table extension for 57-bit virtual addresses (i.e. another 9-bit level of tables), but that still leaves most of the address space non-canonical. i.e. any canonical address would still be 2^63 - 2^57 away from signed wraparound.
Depending on the OS, all your addresses might be in the low half or the high half. e.g. on x86-64 Linux, high ("negative") addresses are kernel addresses, while low (signed positive) addresses are user-space. But note that Linux maps the kernel vDSO / vsyscall pages into user space very near the top of virtual address space. (But it leaves pages unmapped at the top, e.g. ffffffffff600000-ffffffffff601000 [vsyscall]
in a 64-bit process on my desktop, but the vDSO pages are near the top of the bottom-half canonical range, 0x00007fff...
. Even in a 32-bit process where in theory the whole 4GiB is usable by user-space, the vDSO is a page below the highest page, and mmap(MAP_FIXED)
didn't work on that highest page. Perhaps because C allows one-past-the-end pointers?)
If you ever take the address of a function or variable in the vsyscall
page, you can have a mix of positive and negative addresses. (I don't think anyone ever does that, but it's possible.)
So signed address comparison could be dangerous if you don't have a kernel/user split separating signed positive from signed negative, and your code is running in the distant future when/if x86-64 has been extended to full 64-bit virtual addresses, so an object can span the boundary. The latter seems unlikely, and if you can get a speedup from assuming it won't happen, it's probably a good idea.
This means signed-compare already is dangerous with 32-bit pointers, because 64-bit kernels leave the whole 4GiB usable by user-space. (And 32-bit kernels can be configured with a 3:1 kernel/user split). There's no unusable canonical range. In 32-bit mode, an object can span the signed-wraparound boundary. (Or in the ILP32 x32 ABI: 32-bit pointers in long mode.)
Performance advantages:
Unlike 32-bit mode, there are no CPU where jge
is faster than jae
in 64-bit mode, or other combo. (And different conditions for setcc / cmovcc never matter). So any perf diff is only from surrounding code, unless you can do something clever with adc
or sbb
instead of a cmov or setcc.
Sandybridge-family can macro-fuse test / cmp (and sub, add, and various other non-read-only instructions) with signed or unsigned compares (not all JCC, but this isn't a factor). Bulldozer-family can fuse cmp / test with any JCC.
Core2 can only macro-fuse cmp
with unsigned compares, not signed, but Core2 can't macro-fuse at all in 64-bit mode. (It can macro-fuse test
with signed-compares in 32-bit mode, BTW.)
Nehalem can macro-fuse test
or cmp
with signed or unsigned compares (including in 64-bit mode).
Source: Agner Fog's microarch pdf.