Linux - understanding the mount namespace & clone CLONE_NEWNS flag

青春壹個敷衍的年華 提交于 2019-11-30 03:30:46
Emmet

The “mount namespace” of a process is just the set of mounted filesystems that it sees. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, you must decide what to do when creating a child process with clone().

Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes: there was one global mount namespace, seen by all processes, and if any change was made (e.g. using the mount command) all processes would immediately see that change irrespective of their relationship to the mount command.

With per-process mount namespaces, a child process can now have a different mount namespace to its parent. The question now arises:

Should changes to the mount namespace made by the child propagate back to the parent?

Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).

Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).

Thus, we must decide when creating a child process with clone() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the parent, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).

If the CLONE_NEWNS flag is passed to clone(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parent's mount data structures, where changes made by the child will be seen by the parent (so the mount command itself can work).

Now if I use clone with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ?

Yes. It sees the exact same tree as its parent after the call to clone().

Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.

Yes. Since you've used CLONE_NEWNS, the child can unmount one device from 5 and mount another device there, and only it (and its children) could see the changes. No other process can see the changes made by the child in this case.

If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?

No. If you've used CLONE_NEWNS, the changes made in the child cannot propagate back to the parent.

If you haven't used CLONE_NEWNS, the child would have received a pointer to the same mount namespace data as its parent, and any changes made by the child would be seen by any process that shares those data structures, including the parent. (This is also the case when the new child is created using fork().)

I don't have enough reputation points to add a comment so instead adding this comment as an answer. It's just an add on to Emmet's answer.

AFAICU, If a process is created with CLONE_NEWNS flag set, it can only mount those file systems which have FS_USERNS_MOUNT flag set. And almost all disk based file systems does not set this flag (due to security reasons). In do_new_mount, there is this check:

        if (user_ns != &init_user_ns) {
            if (!(type->fs_flags & FS_USERNS_MOUNT)) {
                    put_filesystem(type);
                    return -EPERM;
            }

Please correct me if I am wrong

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!