Threadpools and async IO are all rage these days for writing highly concurrent servers. As far as I can remember this is a relatively new development and a few years ago, highly concurrent servers were written using async IO (without threadpools… there weren’t so many cores then 🙂 ) or using multi-threading. What changed? Are servers designed using async / threadpools better compared to multi-threaded ones? Why?
Multi-threading involves using OS threads to schedule a bunch of concurrent tasks. The OS threads save and restore the entire processor state when a processor core switches the currently executing thread. Processor state can be large for 64 bit processors (16 64bit general purpose registers and 16 128 bit SSE registers = 16* (8 + 16 bytes) = 384 bytes without counting the segment, debug and control registers). There is also the issue of switching to a new stack which may be at a different location in memory and may be cache cold. So if there are threads which context switch very frequently (say in every few instructions) then the context switch overhead can be disproportionately high.
Async / threadpool helps this scenario by keeping the OS thread the same on a core (so keeping the top of the stack hot in the cache) and just calling into different tasks (essentially userspace functions or methods) enqueued in a queue. Switching between tasks doesn’t require saving any processor state because the functions save as much register state as it uses as it moves forward in the code path. This favors small tasks because the little processor state is saved/restored and the top of the stack always remains hot.
However, async / threadpools have a significant drawback. If the programmer fails to split its tasks evenly in terms of resource consumption (some tasks taking disproportionately large number of CPU cycles) then the async / threadpool system adds significant latency to small tasks queued up behind large ones. This problem doesn’t arise with threading because long running OS threads are preempted by the kernel to give CPU time fairly to other threads in the system.
The situation can be ameliorated by increasing the number of threads in the threadpool to be somewhat larger than the number of cores. This can allow some small tasks to be queued (and subsequently executed with a minor hit in latency) in threads which don’t have large tasks. However, this fails to solve the problem completely. You can never be sure of how many threads should you have in the threadpool simply because in most cases you don’t know how many long running tasks will you have in the system. Furthermore, if you have too many threads in the threadpool then you will end up context switching between them needlessly.
A better solution to the problem is to let the programmer break long running tasks into smaller tasks and schedule the subsequent tasks at the completion of the preceding tasks. This will allow for some amount of fairness in the task queue. However, this seems like asking the programmer to solve the same problem for every task he creates instead of having a single proven-to-work solution such as threading take care of the problem.
Is there a good solution to this problem? How about the following:
We bring preemptive scheduling up from the kernel into userland. Essentially, the threadpool has timers on each of its threads / cores and if the timer fires before the current task is finished then the task is suspended (will require saving of processor state) and is enqueued at the end of the thread’s task queue. Windows supports Get/SetThreadContext() APIs which can help with this without creating new threads. Otherwise, one can always create new threads to resume the queued up tasks while keeping the current thread (one with the long task) suspended. The threads given up after preempted tasks are completed can be reused to satisfy more thread creations.
The above will achieve the best of both worlds. It will allow small task switching to be efficiently done in the userspace and at the same time allow long running tasks to be handled preemptively, thus reducing latency issues with async / threadpool systems.