Why are segfaults called faults (and not aborts) if they are not recoverable?

微笑、不失礼 提交于 2020-02-04 01:20:13

问题


My following understanding of the terminology is this

1) An interrupt
is "a notification" that is initiated by the hardware to call the OS to run its handlers

2) A trap
is "a notification" that is initiated by the software to call the OS to run its handlers

3) A fault
is an exception that is raised by the processor if an error has occurred but it is recoverable

4) An abort
is an exception that is raised by the processor if an error has occurred but it is non-recoverable

Why do we call it a segmentation fault and not a segmentation abort then?

A segmentation fault
is when your program attempts to access memory it has either not been assigned by the operating system, or is otherwise not allowed to access.

My experience (primarily while testing C code) is that anytime a program throws a segmentation fault it is back to the drawing board - is there a scenario where the programmer can actually "catch" the exception and do something useful with it?


回答1:


At a CPU level, modern OSes don't use x86 segment limits for memory protection. (And in fact they couldn't even if they wanted to in long mode (x86-64); segment base is fixed at 0 and limit at -1).

OSes use virtual memory page tables, so the real CPU exception on an out-of-bounds memory access is a page fault.

x86 manuals call this a #PF(fault-code) exception, e.g. see the list of exceptions add can raise. Fun fact: the x86 exception for access outside of a segment limit is #GP(0).

It's up to the OS's page-fault handler to decide how to handle it. Many #PF exceptions happen as part of normal operation:

  • copy-on-write mapping got written: copy the page and mark it writeable in the page table, then return to user-space to retry the instruction that faulted.
  • soft page fault: the kernel was lazy and didn't actually have the page table updated to reflect the mappings the process made. (e.g. mmap(2) without MAP_POPULATE).
  • hard page fault: find some physical memory and read the file from disk (a file mapping or from swap file/partition for anonymous pages).

After sorting out any of the above, update the page table that the CPU reads on its own, and invalidate that TLB entry if necessary. (e.g. valid but read-only changed to valid + read-write).

Only if the kernel finds that the process really doesn't logically have anything mapped to that address (or that it's a write to a read-only mapping) will the kernel deliver a SIGSEGV to the process. This is purely a software thing, after sorting out the cause of the hardware exception.


The English text for SIGSEGV (from strerror(3)) is "Segmentation Fault" on all Unix/Linux systems, so that's what's printed (by the shell) when a child process dies from that signal.

This term is well understood, so even though it mostly only exists for historical reasons and hardware doesn't use segmentation.

Note that you also get a SIGSEGV for stuff like trying to execute privileged instructions in user-space (like wbinvd or wrmsr (write model-specific register)). At a CPU level, the x86 exception is #GP(0) for privileged instructions when you're not in ring 0 (kernel mode).

Also for misaligned SSE instructions (like movaps), although some Unixes on other platforms send SIGBUS for misaligned accesses faults (e.g. Solaris on SPARC).


Why do we call it a segmentation fault and not a segmentation abort then?

It is recoverable. It doesn't crash the whole machine / kernel, it just means that user-space process tried to do something that the kernel doesn't allow.

Even for that process that segfaulted it can be recoverable. This is why it's a catchable signal, unlike SIGKILL. Usually you can't just resume execution, but you can usefully record where the fault was (e.g. print a precise exception error message and even a stack backtrace).

The signal handler for SIGSEGV could longjmp or whatever. Or if the SIGSEGV was expected, then modify the code or the pointer used for the load, before returning from the signal handler. (e.g. for a Meltdown exploit, although there are much more efficient techniques that do the chained loads in the shadow of a mispredict or something else that suppresses the exception, instead of actually letting the CPU raise an exception and catching the SIGSEGV the kernel delivers)

Most programming languages (other than assembly) aren't low-level enough to give well defined behaviour when optimizing around an access that might segfault in a way that would let you write a handler that recovers. This is why usually you don't do anything more than print an error message (and maybe a stack backtrace) in a SIGSEGV handler if you install one at all.


Some JIT compilers for sandboxed languages (like Javascript) use hardware memory access checks to eliminate NULL pointer checks. In the normal case there's no fault, so it doesn't matter how slow the faulting case is.

A Java JVM can turn a SIGSEGV received by a thread of the JVM into a NullPointerException for the Java code it's running, without any problems for the JVM.

  • Effective Null Pointer Check Elimination Utilizing Hardware Trap a research paper on this for Java, from three IBM scientists.

  • SableVM: 6.2.4 Hardware Support on Various Architectures about NULL pointer checks

A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.

  • Implicit Java Array Bounds Checking on 64-bit Architectures. They talk about what to do when array size isn't a multiple of the page size, and other caveats.

Trap vs. abort

I don't think there's standard terminology to make that distinction. It depends what kind of recovery you're talking about. Obviously the OS can keep running after anything user-space can make the hardware do, otherwise unprivileged user-space could crash the machine.

Related: On When an interrupt occurs, what happens to instructions in the pipeline?, Andy Glew (CPU architect who worked on Intel's P6 microarchitecture) says "trap" is basically any interrupt that's caused by the code that's running (rather than an external signal), and happens synchronously. (e.g. when a faulting instruction reaches the retirement stage of the pipeline without an earlier branch-mispredict or other exception being detected first).

"Abort" isn't standard CPU-architecture terminology. Like I said, you want the OS to be able to continue no matter what, and only hardware failure or kernel bugs normally prevent that.

AFAIK, "abort" is not very standard operating-systems terminology either. Unix has signals, and some of them are uncatchable (like SIGKILL and SIGSTOP), but most can be caught.

SIGABRT can be caught by a signal handler. The process exits if the handler returns, so if you don't want that you can longjmp out of it. But AFAIK no error condition raises SIGABRT; it's only sent manually by software, e.g. by calling the abort() library function. (It often results in a stack backtrace.)


x86 exception terminology

If you look at x86 manuals or this exception table on the osdev wiki, there are specific meanings in this context (thanks to @MargaretBloom for the descriptions):

  • trap: raised after an instruction successfully completed, the return address points after the trapping inst. #DB debug and #OF overflow ( into) exceptions are traps. (Some sources of #DB are faults instead) . But int 0x80 or other software interrupt instructions are also traps, as is syscall (but it puts the return address in rcx instead of pushing it; syscall is not an exception, and thus not really a trap in this sense)

  • fault: raised after an attempted execution is made and then rolled back; the return address points to the faulting instruction. (Most exception types are faults)

  • abort is when the return address points to an unrelated location (i.e. for #DF double-fault and #MC machine-check). Triple fault can't be handled; it's what happens when the CPU hits an exception trying to run the double-fault handler, and really does stop the whole CPU.

Note that even Intel CPU architects like Andy Glew sometimes use the term "trap" more generally, I think meaning any synchronous exception, when using discussion computer-architecture theory. Don't expect people to stick to the above terminology unless you're actually talking about handling specific exceptions on x86. Although it is useful and sensible terminology, and you could use it in other contexts. But if you want to make the distinction, you should clarify what you mean by each term so everyone's on the same page.




回答2:


There are two types of exceptions: faults and traps. When a fault occurs, the instruction ca be restarted. When a trap occurs the instruction cannot be restarted.

For example, when a page fault occurs, the operating system exception handler loads the missing page and the restarts the instruction that caused the fault.

If the processor has defined a "segmentation fault" then the instruction causing the exception is restartable—but it is possible that the operating system's handler might not restart the instruction.



来源:https://stackoverflow.com/questions/49396346/why-are-segfaults-called-faults-and-not-aborts-if-they-are-not-recoverable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!