问题
How does Linux determine the address of another process to execute with a syscall? Like in this example?
mov rax, 59
mov rdi, progName
syscall
It seems there is a bit of confusion with my question, to clarify, what I was asking is how does syscall works, independently of the registers or arguments passed. How it knows where to jump, return etc when an other process is called.
回答1:
syscall
The syscall
instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:
IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
THEN #UD;
FI;
RCX ← RIP;
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH
CS.Base ← 0;
CS.Limit ← FFFFFH;
CS.Type ← 11;
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
CS.D ← 0;
CS.G ← 1;
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
SS.Base ← 0;
SS.Limit ← FFFFFH;
SS.Type ← 3;
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
SS.G ← 1;
The most important part are the two instructions that save and manage the RIP register:
RCX ← RIP
RIP ← IA32_LSTAR
So in other words, there must be code at the address saved in IA32_LSTAR
(a register) and RCX
is the return address.
The CS
and SS
segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)
The #UD
may happen if you do not have the right to execute syscall
or if the instruction doesn't exist.
How is RAX
interpreted?
This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if RAX > __NR_syscall_max
), then dispatches to (C syntax) sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);
; Intel-syntax translation of Linux 4.12 syscall entry point
... ; save user-space registers etc.
call [sys_call_table + rax * 8] ; dispatch to sys_execve() or whatever kernel C function
;;; execve probably won't return via this path, but most other calls will
... ; restore registers except RAX return value, and return to user-space
Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of call *sys_call_table(, %rax, 8)
from ENTRY(entry_SYSCALL_64)
in Linux 4.12 arch/x86/entry/entry_64.S (before Spectre/Meltdown mitigations were added). Also related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? has some more details about the kernel side of system-call dispatching.
Fast?
The instruction is said to be fast. This is because in the old days one would have to use an instruction such as INT3
. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow RTE
to exit the exception state and return to the address just after the interrupt. This is generally much slower.
With the syscall
you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.
Another instruction which is used along syscall
is swapgs
. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.
New Process?
The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.
When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:
- Make sure the executable exists
- Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
- Allocates a stack buffer
- Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
- Setup the start address in the task (ELF entry point from the executable)
- Mark the task as ready (a.k.a. running)
This is, of course, super simplified.
The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current RIP
pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.
(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)
All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.
(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)
Determine the Address?
So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by ld
). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.
It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.
In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.
Also, if the process is already running and you starting it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as copy on write.)
来源:https://stackoverflow.com/questions/56854297/how-syscall-knows-where-to-jump