Let's look at it once more…
There has been large debates over Operating System design principles. While there was a huge hype for microkernel based architectures in the 1990's in the OS research community, till now there has been little convincing evidence of microkernels solving the problems they were initially targetted at. There are quite some documents on the WWW which help one to get into the pros and cons of each approach. IMHO, Microkernel Debate [cliki] is one of the best in disambiguating the technical issues involved. There is also the famous Tanenbaum-Trovalds debate which helps in getting more political issues involved and also interesting in dispelling some of the hype microkernels are entitled to.
This document is not supposed to be a line by line comparison between microkernels and monolithic kernels. This document is an attempt to put down a flow of thoughts I had had in recent past about operating system design. It is still a 'flow' of thoughts as the conclusions that I draw here are not guaranteed to remain the same after sometime. However, being aware of the existence of this 'flow' and pondering and designing for a well predicted future is what I think is also very important. In other words the aim of this document is to hair-split the technical issues involved in OS design and eventually lead to a set of design principles to guide the design of an OS. The final design might not be completely monolithic or micro but should be a well flavored mixture of the two.
Let me start by describing my grapplings with OSes and the timeline of my thoughts as it is generally nice to listen to history than to technical drab straight :-). My first course in operating systems was during my undergraduate studies in which we were taught general OS principles referring to UNIX as a reference. It was made very clear in the course that context switching between processes was hugely expensive and should be avoided as much as possible. More importantly, this issue was used to pull up an argument against microkernels when they were discussed. Microkernels seemed more like a tradeoff than anything else then. Now tradeoffs are bad. They are very effective vehicles of proving validity of a piece of research work when actually some thorough analysis is also very necessary to put things in light. I approached my OS professor and discussed with him an operating system design in which context switch overhead could be reduced. I was basically referring to single address operating systems with safe language based mechanisms for ensuring memory safety. The design not only gave insights into ways the context switch overhead could be reduced but also building very flexible capability and access control into the system. I was very excited about the project and looked frantically on the internet to find similar projects so that I could learn from their experience. I found quite some. Safe language kernels were then also being explored by Cornell University, GMD and University of Washington. There had been work in similar direction in the past too with SPIN-OS being a prominent one that comes to my mind now.
All this was great going… but I suddenly (and somewhat painfully) realized that for some reason, I was not able to sell the idea of 'reducing context switch overhead' very well to people. Stability of the system, its generality (in terms of executing programs written in non-safe languages), depending on a programming language or paradigm (the object paradigm) for the entire OS structure, unfeasibility of having a C based POSIX interface for the OS were some of the arguments against it. I was obviously hitting on performance again and again but for some reason 'context switch overhead (cache and TLB flushes)' was not giving me the force I needed to convince people. So I decided to explore more and very recently came across three independent events which changed my outlook to OS design forever.
Event 1: I believed that system call overhead was large and hence it made a lot of sense to do things in userspace as much as possible. For example, one could gain in computation power by using user level threads over kernel threads while doing compute heavy jobs. I knew that Linux had a system call interface which was based on an software interrupt invocation. This was hugely expensive on pentium 4 which caused Linux kernel developers to adopt the 'sysenter' path for a system call. I realized by a jolt them that processors could internally be very different on performance scales even if we are dealing with processors of the same ISA. I faintly realized the possibility of the hardware designers changing some design somewhere which would basically nullify all performance claims previous OS designs have made.
Event 2: I had come across Scheduler Activations a long time ago in my undergraduate studies and had marveled the approach. Very recently while discussing symmetric multi-threading with somebody, schedular activations reappeared in the discussion. Later in the day, I also remembered the recent effort in improving Linux threading and knew that the NPTL designers would have given scheduler activations a look. A quick search on the internet revealed an interesting discovery. NPTL used 1:1 kernel threads for user space threads! They showed that with the recent advances in the Linux scheduler and efficiency and design of new hardware, it was possible to get unprecedented performance with 1:1 kernel threads. This reminded me of "Event 1" and I become very convinced about hardware of the day dictating OS design.
Event 3: I was browsing through some documentation and papers on L4 when I came across this paper called "Small Spaces: Performace of Address space multiplexing on Pentium". This paper discussed the concept of an ASID (Address Space ID) found in modern hardware like the PowerPC which could avoid a lot of cache flushes. This had the potential of breaking the foothold of the "address space multiplexing is expensive" argument (which as I had earlier mentioned was the cause of me focussing on language based safety in Operating Systems). The paper also gave an overview of a technique using segmentation on Pentium, for doing the same thing because Pentium didn't have ASIDs. This could potentially allow protection based on a hardware level to be used with almost the same efficiency as a single address space OS and simultaneously providing the generality of the programming language/paradigm. I realized the importance of the role of hardware trends in OS performance and also the importance of abstracting out the complexities of the hardware at various levels in the operating system.
In fact, I also realized that OS designers are generally ripped apart between the hardware interface and the software programming interface. With the hardware interface changing so dramatically it becomes more and more difficult to give a consistent and efficient programming interface to the programmers. However, it is important to note that there two interfaces have two different forces driving the dynamics therein. For the hardware interface, it is basically the need to do things faster. However, in the software interface compatibility is the holy grail. Programmers will not mind if there is a programming API of their choice supported in your OS irrespective of how compatible is it with the OS architecture in general. Thus, I see the OS sitting in the middle of the hardware and the programmer interface with multiple programmer interfaces being available to the programmer to code in. What matters is the consistency and the stability of the interface. Of course the well thought implementation of a programming interface on an OS will always be a nice problem to solve. Thus the need for the OS to be more flexible in terms of what's possible on the hardware and least restrictive about the demands of the programming environment abstractions.
Hardware trends have a lot of impact on the general architecture of an OS. Since the microkernel design aspect of a kernel is completely internal to the OS, its performance is a great metric of its choice along with other benefits like modularity or maintainability. However, all such benefits should be clearly evaluated before a design design is made on their basis. In this document we would like to dissect the operation of kernel calls in design choices and evaluate them on the basis of performance, maintainability and modularity.
Let us first touch on modularity. Modularity is the aspect of software which enables the developer or the user to assemble complex working systems using simpler components. For a developer, the issue is more about code modularity. The developer would like to have the OS code in a way which is easily maintainable by him. This would mean that the code for OS should be divided into loosely coupled sections. A microkernel architecture would force independent parts of the OS to different address spaces and would force an IPC protocol for all communication. The advantage of using this approach is that the native code that runs for the particular independent section of the OS now has absolute control over data structure layout, program execution runtime and to some extent also the compiler ABI. Thus, almost any working assembly code can run for the concerned section of the OS. However, this will not be the case for a monolithic kernel architecture. In such an architecture, the execution runtime of the implementation language is common throughout the kernel. Hence, the compiler ABI and data structure layout has to be common for the entire OS (or at least the part which is monolithic). This might sound as a big restriction where actually it is not. The compiler manages the entire runtime very efficiently for a program. The only bottleneck is to understand this runtime so that certain parts of the OS which require fiddling with it (process dispatch, memory bootstrap) are handled with knowledge. On the other hand, using a common execution runtime for the entire OS has huge gains in terms of performance. There is no need for a "context switch" when different parts of an OS are communicating with each other. More importantly, the execution runtime of the language forms the standard communication medium for OS services. The benefit of this communication medium vs any other communication medium (like a hand crafted IPC mechanism, or a shared memory/data structure mechanism) is of performance. A compiler is smart at organizing instructions for the underlying hardware and interacting it with data structures in memory. Also more importantly, it is bewilderingly more capable for handling such detailed but effective optimizations for a huge chuck of code. However, these virtues are not available without a price. The runtime should be really robust at handling programming errors. I would like to point out here that even though 'C' is the language of choice when it comes to OS implementation (generally because of the flexibility of the language and its efficiency) it is not powerful enough. 'C's runtime is not robust to failures. It doesn't have provisions for link time failures, neither is the type checking robust. One might argue that avoiding heavy use of pointers in 'C' can make software written with it more robust. However, without pointers quite some power of the language is lost. What one needs in a language for an OS (or in fact any other software) is a language which is able to guarantee safety along with expressiveness. Another comment about language runtimes in general that I would like to make is that all of them are targeted towards a single application or a program. An execution runtime gives a complete environment to a running program. Such an environment as in the case of an OS should be able to use the hardware of the day effectively for all its tasks. I believe that the runtime of a language should itself be powerful and customizable for target hardware.
Talking of modularity, there is another form of modularity in terms of what the user interface to an OS is. For example, device drivers or services of an OS are said to be modular if these can be dynamically loaded into the running OS. Dynamic incremental feature additions and removal make a system more amiable to change while it is running. Such an arrangement requires more effort than the 'code modularity' we discussed previously. Not only would one require the code to be modular enough (such that a thread of execution doesn't traverse the unloaded code), it would also now require mechanisms to dynamically load code into a running system and link it to the existing code. Again, the microkernel approach gives us a useful insight here. The microkernel abstracts away from the runtime and independently handles loading, execution management, communication, errors and cleanup of the dynamically loaded code. It has the advantage of giving the code writer complete freedom to choose his data structure layout and runtime ABI. However, this comes at the cost of performance. In fact the way an OS performs execution of a program on the demand of the user can be thought of as a complex mixture of dynamically loading code, linking it (to its shared libraries) and starting its execution as another thread/process. This is similar in functionality to dynamically loading code in programs with the exception that an OS handles actual timesharing and scheduling of the processes and also allows protection between code domains. On the other hand, implementation of this sense of modularity in a monolithic kernel is simple. Of course all modules and the "nugget" code require to share the same runtime system.
Let us now try to look at a system call invocation in a microkernel system in detail. Let us assume that we have a filesystem server and a user program is trying to write to a file. In practice it is common to find a disk server which is called by the filesystem server for doing the physical disk write. The system call from the user program will cause a library routine in the user space to be invoked which detects the filesystem server and makes an IPC call to it. Let us assume that the filesystem server is sleeping. In the IPC call, the machine state of the called thread is saved and the calling thread is enqueued into the wait queue. Next, the filesystem server thread is dequeued from the wait queue and it resumes execution after restoring the machine state. Unless there is a sophisticated shared memory mechanism for passing IPC parameters, they are copied by the IPC mechanism to the filesystem user space. The filesystem server uses this information to issue a disk write. Now, the differences between this and a conventional functional call for the specific operation are the following:
- There are two stacks involved and hence the killing of one process has no effect on the other process. The user process will get an error in the IPC call if the filesystem server just crashes. This will be unlike a monolithic system where there is basically one stack to record the history of the entire path of execution. In such a case, a server crash is fatal because it crashes the user program with it. However, we should note that two separate execution threads are never needed for the user program and the filesystem code to run. The user program thread sleeps when the filesystem server is running on its behalf. (Effectively, the server thread also sleeps (as in not carrying out the task for the user program thread) when the user program is running.) The issue of parallelism is completely orthogonal to dividing the task at hand into threads and has more to deal with asynchronous vs synchronous modes of operation. Hence effectively for the complete duration of the system call, there was only one "thing" being done at a time… only a single conceptual thread running. Threads were used as a mechanism for error recovery/protection only. We claim that the overhead of a complete thread/process context switch is not required for this.
- copying of the parameters is completely redundant. Using a shared memory mechanism for sharing the parameter memory usage is as unsafe as a sharing the address space of the two servers. Also, the marshaling and unmarshaling of the parameters is duplicating the effort spent in the compiler for implementing calling conventions.
- Microkernel architecture is claimed to be more robust because of the fact that code running in different processes and communicating using IPC is easier to administer. Killing a process viewed from another perspective is simply removing code from the system with good error recovery from the rest of the code in the system. All processes waiting for a reply from the killed process get an error right in the IPC call they made.
- Processes/threads are also the mechanism to provide parallelism. Thus two threads running together are timesliced by the operating system so that they look as if they are running in parallel. However, when combined with synchronous IPC, they loose this parallelism. Synchronous IPC causes another process to run while the IPC invoking process waits. Asynchronous maintains the true spirit of parallelism in the system. Of course threads of execution wait on critical sections so that they don't corrupt say a shared datastructure. However, process synchronizations mechanism should be as asynchronous as possible. We believe that along with threads, critical sections and asynchronous notifications, we should add extra features to the runtime which would add the missing features to the runtime… basically a substitute for synchronous IPC. A process though a very basic concept in operating systems is central because it is overloaded with features. Lets dissect the functions of a process in an operating system.
- A process is a unit of parallelism.
- A process is an instance of code running with process specific data.
- A process encapsulates the unit for system error containment.
- A process is a unit of access control
- A process is a unit of execution history of the thread of execution.*the above assumes that there is only one thread per process.
We would ofcourse like a more flexible system. One in which the process is not that overloaded. Because the process is so overleaded, trying to do anything flexible gets a performance hit. Almost any degree of flexibility requires creation of more processes increasing the overhead in the system. We would like to keep the process as the unit of parallelism. This is because any two independent execution threads have independent execution histories. This every process has a unique stack for itself. Switching between processes would basically require us switch between the stacks. Our new concept of process has nothing more to it beyond this besides perhaps a process identifier. Then of course we have the concept of a critical section which is basically code enclosed between mutex lock and unlock calls. There is an interesting trick we can use here to get better performance from the system. The context switch code could be optimized so that it is not timer based but voluntarily done by the code of the process itself. This would be more like cooperative scheduling except that he compiler will have a pretty heavy role to play here. By suitably placing the yield points at various locations in the code we can save the register saving overhead and also implement critical sections very efficiently. However, this has the disadvantage of disabling context switching all together whenever we are in any critical section.
The unit of fault containment is yet another unit say called a module. A module can be loaded into the system at runtime and can also be unloaded from the system with full error recovery. However, there is no process/thread associated with it when it is loaded into the system. Of course the system can initiate an execution path into the code and allow (by changing certain global data strcutures in the system or sending notifications to other threads) other threads to enter the code for the newly loaded module. However, the important and interesting part is error recovery. How do we unload a module? You could have more than one processes executing the module's code during the time we called for unloading the module. They key to the problem is to mimic the thing we do in traditional process oriented OSes. Killing a process there would halt that thread and send all waiting processes an error. So in our case, we just have to ensure that the threads that are executing in the module return back to callee of the entry point in the module and return a 'module unloaded' error there. This would be similar to the error returned in an IPC call which says 'the process was unexpectedly terminated'. However, this was easy in the case of processes as there were different execution history stacks maintained for the two code bases. Removing one meant, waking the thread associated with the other stack with some kind of a signal. This thread would be able to resume normally as its execution history beyond its code base was recorded (on the behalf of the other thread) in a separate stack. This is not possible in our case as we have the same stack through out. However, we can still inspect the stack and go to the stack frame which represents the first instance of an entry point into our module. We basically have to discard the entire stack from its top till this point and return from the stack point to the callee with an error. Of course if the start of the thread itself was in the module being unloaded, the thread will have to be removed too. However, we take care of this special case by having all threads go through a special module (called the thread start module) which is never unloaded from the system. This module will take care of thread starting code and it redirection into the relevant places.
What is the advantage of doing the above? Well, now an IPC call into another codebase (module) becomes a simple (or somewhat more complicated if we do some book-keeping for stack inspection) function call. We could have servers which have entry points instead of a pool of threads trying to cater to a large number of clients. Any other thread is free to come into the module, execute some code and return back. This is similar in picture to the client doing an IPC call, the server process delegating the request to a new thread while keeping the initial thread for getting more requests. In our model, the client thread itself serves as the thread of execution of the delegated task. This avoids a lot of overhead. It also helps us think about single threaded applications which have modules in them and require robust loading and unloading. We have till now not given a way for our processes to suspend or halt their execution (except when its waiting on a critical section). A process can call wait(locks, queue) to release some locks and wait on a queue (the operations need to be atomic to avoid the lost wakeup problem). A notify(queue) by another thread can cause the thread waiting in the queue to start executing (after reacquiring the locks). notify(queue) is asynchronous and causes both threads to run after it is called. It is also possible to make this async process synchronization mechanism to be a communication mechanism by sending an argument in notify() which will be received by wait.
What should be the unit of access control? We again, according to the trend, don't overload the existing concepts but solve this orthogonal problem using an orthogonal primitive: capabilities. A complete capability based system is more flexible than a system with thread specific credentials. Using capabilities we would be able to keep capabilities with resident module's data and use it for any thread going through it.
We have been talking about having a common runtime for all code in the system. However, till now a runtime is specifically tied to a programming language. Now having a common runtime for the entire OS would be an overkill because now all programmers programming for that operating system would be forced to use that programming language. However, there is a small caveat in this potential disadvantage. Consider an existing operating system like UNIX. There are a number of things about UNIX which make it very much tied to a particular language (in this case 'C'). Though all the system calls are implemented as either using a software interrupt or a facility like 'sysenter', this implementation detail about the kernel is hidden from the programmers using a library… on UNIX the standard library is libc implemented in C. Thus any other language developer has to resort to C based function calls to do anything with the system. One might say that this argument is weak because it is possible to have a similar standard library for the kernel calls in any language developed in a manner similar to libc. The caveat is that not all programming models support the notions supported by the kernel. For eg. it is questionable how a functional language would develop this library. It might be possible that the operating system model for a purely functional language based system might be different. There is yet another thing peculiar about UNIX. UNIX supports the notion of callbacks into the process as signals. The signaling API uses a C function pointer to register the callback. Does this mean that the UNIX system is optimized for the C programming model? Is it general enough to support a callback into a different language? If this implementation detail is delegated to libc and completely abstracted away from the UNIX kernel then it would be possible to design signal handling mechanisms in different languages without resorting to an intermediary C runtime.
So given the fact that it might be possible to design (or as per the above paragraph even reuse an existing) operating system kernel for an operating system which is independent of a particular language runtime, we now have the possibility of supporting a common language runtime over it for all the languages implemented over it. However this seems like a step back. An operating system kernel independent of the language runtime has its merits of supporting different programming models efficiently (under the constraints of the overall kernel design and abstractions supported). A common language would rigidly enforce a programming model taking away this flexibility from us. This brings us to a similar debate between the programmer convenience in API design (the actual programming language that we need to support in this case) and the performance characteristics of the machine and hence the kinds of abstractions that it can efficiently support (the common runtime mainly aimed at improving efficiency of the various tasks in the OS like process communication, reducing context switch overhead by reducing the number of threads and safe loading and unloading of code). We again solve this in the same way the similar problem is solved. That is, by implementing a generic kernel underneath and implementing the desired programming language API over this generic kernel abstraction. However, what should be our guiding principle in designing these abstractions? Well the same as for our 'similar' problem. The hardware characteristics determining the abstractions determining the things we can support in the common language runtime and the actual language runtime supported over this common runtime as efficiently as possible. As in our previous problem, the hardest part of the problem would be to find out how to implement an efficient language runtime over the common runtime.
That said, the distinction between the common runtime and the OS is actually slightly blurred. What is it now? Well basically it now becomes a standard way of communication between code (written even in different languages) in the operating system. No more clunky inflexible "every thing is a process" model. This also requires a tight integration between the actual kernel and the common runtime. We will take the kernel to be that part of the operating system which runs in kernel mode. Our common runtime would run in the user mode. Ideally we would like to have the modules address space protected. This is not possible because execution of a single thread can cross module boundaries and we have a common stack. This means that we either resort to a single address space architecture, or we use shared memory for modules which interact with each other and put the stack in shared memory too. [Obviously this needs more thinking…]