There is one thing about the unix process model that I have come to like a lot. It is the idea of a process hierarchy and the child processes inheriting aspects from parent processes. This eliminates the big complication in passing process "property" parameters while creating a new process. Since a new process is always fork()ed (optionally configured) and a child exec()ed into it, there is a lot of power inherent in this configuration that can be done between the process is forked and an exec called. What is this power that I am talking about? Firstly the same system calls which could be used to change the behavior of a running process can also be used for a freshly created process. Secondly, and more importantly (in terms of flexibility achieved), each process runs in an environment especially tailored for its execution. Take the case of a chrooted process (it executes peacefully under a different directory as its root), or the case of environment variables, where you could govern the execution of several processes by executing them under different environments.
A very similar idea applies to per process filesystem namespaces. I think the first time I heard about them was in the context of plan9 and I kind of liked the power of the idea. Some useful things that could be done with per process filesystem namespaces would be
- populating /dev/ with devices of the user's choice.
- userspace filesystem implementations.
- application specific tricks.
However, even though this concept has potential power and utility, I don't see it implemented in the free unices of today (my favorite is linux and it is sadly missing there). However, let me take my statement back. Per process filesystem namespaces are present in Linux in a number of forms! However, none of the techniques are as powerful as described in this document. Some of such techniques are:
- chrooting. (can only change the root, can only be done my a privileged process)
- clone (only privileged process)
- suid mounts (different processes must use different mount points to avoid clashes).
In order to call a mechanism good enough to give us per process filesystem namespaces we must have confirm the following properties in the mechanism.
- child inherits parent namespace.
- the namespace should be alterable while the process is running (so that our lovely fork(); reconfigure(); exec() thing works.)
- the namespace altering should have no effect beyond the process subtree concerned.
Let me chalk out a possible implementation path. The linux kernel incorporates a virtual filesystem switcher or a vfs layer. It basically keeps track of a set of mount points for the entire system and attaches a physical (in the sense of having a filesystem kernel module for it) filesystem to it. A directory in a physical filesystem becomes a potential mount point. However, once something is mounted on it, it is now tabulated in the mount points list and if it is encountered while a path is parsed, the vfs switches to the appropriate physical filesystem to complete the operation. Since mounting a device on a mount point brings about a system wide change, it can only be done by the root. Consider an implementation when this vfs layer is pulled up in the libc. The mount points are declared in the environment of the process. During the first call to a vfs function, the vfs system is initialized. This basically involves setting up your mount tables. These tables enable you to the vfs switching. The actual operation is done on the mounted (please note the alternate meaning of mounted as in mounting a physical filesystem (to someplace)) filesystem by the libc and the correct error codes are returned. The environment was chosen for keeping information about mount points because it is generally completely inherited by the child process. However, if parts of libc could be put copy on write shared memory, this thing would have been easy. So, the idea is that the physical filesystems are mounted in say /fs. Then using the environment, /etc/fstab mappings are made right somewhere near the init process.
Using a libc based vfs opens up the possibility of using userland filesystem implemented as shared libraries. The environment can again act as a nice way to provide libc with the name of the shared library to be loaded and symbols linked. Such userland filesystem techniques require no trip to the kernel (well besides accessing the raw disk for the data 🙂 ). Most userland filesystems have to go through a kernel module which makes an IPC with a single userland process (thus sacrificing concurrency) which might make more kernel calls for the data to service the userland filesystem request. Please note that it is still not possible to use systemwide filesystems using a shared library because, such filesystems have access restrictions on directly interacting with the raw block device.
I also have a prototype implementation of per process filesystem namespaces for linux. The implementation is basically an interposing library for glibc which uses path name mangling to get the effect of per process filesystem namespaces. The information about the namespace is stored in the process environment in the variable USERFS_MOUNTS. I will probably furnish more documentation as the project proceeds to completion. Please contact me if you want the source code.
[Update] I tried putting in some more code (basically more than just the unlink system call) and found that there were just too many calls in the C library which take a file pathname. It proves to be a bit tedious to make an interposing function for *all* of them… I give up… (sigh). However, I still do feel that naming of the filesystem elements in the userspace and layout/access control in the kernel space is a good way to split things up.
[Update 2] I saw a thread on the linux-fsdevel mailing list going on (related to FUSE) which had slowly digressed into namespaces and the way they should be handled. I posted my approach to solving the problem completely in userspace and the principal problem seen was backward compatibility with suid binaries which deal with user files. So suid binaries would be using a default namespace as the namespace of the user cannot be trusted; however, the user namespace will be used for specifying all the files that the suid binary would work on. Things start going out of the boundaries of an elegant solution here.