With the upcoming multi/many-core processors, there is an industry wide push towards programming for task-parallelism. Without it, programs aren’t getting any faster. There are high level programming constructs such as OpenMP and ConcRT which can be used at the programming language level to exploit task parallelism. However, one often forgets that every task runs within a context (usually a thread context) and switching between contexts (for synchronization or fair scheduling) is costly. It is generally recommended that one should try to expose as much parallelism in their programs as possible and let a runtime schedule them on the cores that are available on the system. Of course, with that comes context switching cost till there comes a time where dividing up work into finer parallel chunks leads to poor performance due to a large amount of context switch overhead.
At the hardware level, task parallelism allows the processor to schedule instructions from multiple streams together on its functional units. It can do so because the individual tasks are supposed to be independent instruction streams. However, with it also lies the caveat, these streams often contain a large amount of state which the hardware manages independently and which the system software has to save/restore on task switch. While very fine grained task parallelism seems like an elegant goal to achieve with software in general, after a level of parallelism granularity, task switch overheads can spoil the entire show. [A task switch in (system) software involves saving and restoring the entire register state including vector registers which can be a large amount of task state].
Fortunately, there has been some work (in computer architecture) to deal with this problem. One of the nicest examples I found was the Tera Computer system. Its architecture is detailed in this paper [PDF]. There are two notable features of the Tera Computer (later the Cray MTA) which enabled low overhead task switching:
- A large number of hardware thread contexts. The Tera Computer processors had 21 stage pipelines. However, a single processor could store the context of 128 threads in total. Thus, it was very effective at hiding instruction and memory latencies.
- The Tera Computer supported hardware based thread synchronization primitives. You could make a (hardware) thread wait on the state of a memory word which you could update from another thread, thus “waking” up this thread. There was no need for system software to do any context save/restore as the thread context was always present in hardware.
However, the above technique seems to have a limitation. If my program has more than 128 hardware contexts then I need to resort to context save/restore. If my program has fewer than 128 hardware contexts (but enough to keep the 21 stage pipeline busy) I am wasting hardware resources (register storage).
I thought about this for a while and it occurred to me that register based architectures are perhaps not suitable for MIMD (task-parallel) processors. I know this is a bold claim but I feel that may be resorting to a memory operand based architectures (where an instruction’s operands are memory addresses) may eliminate context switch overhead all together. Note that the working set of a hardware thread is probably cached on the processor L1 cache and is available as fast as register storage itself*. However, having 3 64bit memory addresses in every instruction doesn’t seem like a good idea. It increases program size and takes up valuable instruction memory bandwidth. What I propose instead is to do something similar to the register stack engine in Itanium or register windows in SPARC. However, instead of using bona-fide registers, the instruction operands are offsets from a per-thread stack pointer. Using offsets allows register like compressed instruction encoding while at the same time not deviating from the memory operand instruction model. The only context that the software needs to save on a task switch is the stack pointer which is okay as its just a single register. Thus task switching simply involves saving/restoring the stack pointer and resuming execution by popping the return address off the stack. With this CPU architecture, its possible to do lightweight context switching without any assistance from hardware (like the Tera Computer).
But who am I kidding… I am going to see only x64 and ARM for the rest of my life :/.
Update: When I was discussing this with a colleague, an interesting question came up: How do we integrate thread wake-up from software and hardware events with the above scheme? At first I thought just having a trap for the hardware events would be okay… however hardware events may require a much lower service latency and special processing than software events to get good performance (eg. wake-up after L1 cache miss). So I finally thought of the following: the architecture keeps separate lists of ready-to-run contexts (stack pointers) for hardware and software events and service the hardware ones preferentially. While notifications for wake-up of software events could be done in software, the hardware would need to take care of hardware wake-up events and wake-up (and resume the halted instruction) in the corresponding threads. Thus, it is also important to keep this list in pinned L1 cache. There also needs to be a way for the hardware to let the software know when a previously hardware interrupted thread is ready-to-run… a trap could work here as long as we guarantee that it won’t cause any further trap invocations. Another simpler and more performant mechanism could be to just pre-empt the current running thread with the hardware woken one and maintain a stack of contexts to switch back to the pre-empted ones once the current one arrives at a hardware / software synchronization point.
One may also think of using low overhead task parallelism to cleanly emulate cache hierarchies, cache coherency and many other kinds of system specific processor facilities (reminds me of the PALcode in DEC Alpha).
* The fact that the Itanium architects went with a backing store for their register stack (and not direct memory operands) makes me a little skeptical that my idea can be implemented in a performant way on real hardware. However, given a large number of switchable contexts, an occasional event of high latency to L1 shouldn’t be a real problem.