Something surprising and amusing happened near thanksgiving 2005. No, god didn't physically appear to receive my thanks… no, not that. I was working during thanksgiving 2005 on the GPUSort project (with the project lead, Dr. Naga Kiran Govindaraju). The surprising thing was that we were working on breaking the Sorting benchmarks hosted by Jim Gray at Microsoft Research using err… a GPU. GPUs have received a fair amount of hype these days for their compute power. However, working intimately on a realistic, working, practical, compute based project using GPUs gave me a lot of opportunity to learn about how a GPU achieves this power, whereas two Dual core Xeons fall short of the task. Let me take this opportunity to share my experiences and insights.
The GPU Sort algorithm
The GPU sort as evident from the name, does the sort operation on the GPU. Since the data sizes that are being handled here are larger than the memory present in systems, we sort chunks of data first and merge all the chunks thus obtained. We call these two phases as the "internal sort" phase and the "merge" phase. The merge phase is completely on the CPU. However, the internal sort happens on the GPU and CPU combined. We use a hybrid bitonic radix sort for the internal sort. The GPU sorts the data on the first 4 bytes of the data (actually 30 bits) and returns the ordering. The CPU does the rest of the ordering. This kind of a scheme utilizes both the GPU and the CPU and gives us the highest performance.
The GPU cannot do string compares. Hence, the keys are first converted into floating point numbers so that they can be operated inside the GPU. This data representation change is done on the CPU. In fact, the internal sort phase has a 5 stage pipeline and we observed that we could run all of these in parallel at the bottleneck IO rate.
- Read data from disk.
- Generate floating point numbers for the first 30 bits of the key.
- Sort the data using these 30 bit keys on the GPU. Complete the rest of the sort on the CPU.
- Re-order the satellite data given the complete sort order.
- Write the sorted records to the disk.
As you might imagine, Reading and Writing from disk are the bottleneck stages running at 200MB/s (We use a 4 SATA disk RAID for input and a separate 4 SATA disk RAID for output). Generating floating point numbers is very fast (1GB/s). Sorting (GPU and CPU combined) runs at 300MB/s and reordering runs at 500MB/s. However, the above rates are observed when the system is running just that stage. These stages share a lot of hardware resources on the system (CPU, bus, memory) and it was quite a balancing act to get all of them working in lockstep.
The GPU has a restricted compute model. The best way to describe its compute model would be to think of a programmer supplied function which takes in the current pixel coordinates as argument. The function computes the value of the pixel. The GPU takes care of calling this function (order unspecified) and takes the burden of actually putting those pixel values on screen. Thus, a GPU program doesn't have a lot of freedom in memory writes (the only writes possible are of new pixel values at gpu-determined screen(/texture) locations). Thus, a normal CPU sort algorithm (like quicksort) doesn't fit very well with the GPU compute model. The GPU sort uses a parallel bitonic sort to perform its sorting and that fits the compute model much better. It is interesting to note that bitonic sort has a worst case (and also the average case) runtime complexity as O(nln2(n)) which is slower than most CPU sort algorithms (O(nln(n))).
The sorting on the CPU is performed using an insert sort variant. This is because, once the data is sorted on the first 30 bits, the unordered data typically appears in small contiguous chunks for which insert sort is very good.
Disk I/O is also a tough nut to crack. Sequential file I/O (Windows XP) allows us 50MB/s from a SATA disk (Western Digital Caviar). A 4-way software RAID provided us with 200MB/s of disk I/O. We use asynchronous I/O with a large number of outstanding requests and large buffers to get the bandwidth. Surprisingly, software RAID seemes to work very well for our job even though it is widely believed to be inferior than hardware RAID.
The merge phase uses a priority heap to merge the sorted chunks. Even here, direct string comparisons performes poorly. The keys are converted to integers, comparing which proves to be a lot faster. To feed the priority heap with data, we keep two input buffers per sorted chunk one of which is used for I/O and the other used for the actual merge. Similarly, two ouput buffers are used to run merge and output to disk in parallel. For the disk input, to sustain high disk I/O rate, quite a few buffers are prefetched together. This is made possible as we pre-calculate the entire buffer prefetch order for the merge in the internal sort phase itself.
The benchmark we broke is the Indy Pennysort benchmark. It basically consists of sorting 100 byte records which have the first 10 bytes as its key. The order between the records is defined by a strncasecmp() (or strnicmp() in windows speak) on the key. It is mandatory to fetch these records from the filesystem and write the sorted records back to the filesystem, which means that high bandwidth IO is also required. The benchmark is to sort at maximum speed in the least amount of money.
Previous benchmark holders have very impressive numbers. They can sort 40GB (433M records) in 1541 secs on a $614 linux/AMD system. This is equivalent to a sorting speed of nearly 25MB/s. Details about the setup can be found at the page linked above. We have been able to achieve 25GB (255M records) in 300 seconds on a $1600 Windows/Pentium/Geforce 7800GTX system. This brings us to around 85MB/s. Hence we break the benchmark by (85MB/s / $1600) / (25MB/s / $600) = 1.275 times. The details of the setup and algorithms are present in our papers on the topic, the links of which should be coming soon. Here I would like to explain some of my insights (not present in the paper) about some of the problems CPUs face in gaining performance on a sorting like workload.
The Memory Bottleneck
It is very widely known that CPUs are way more capable of computation than the amount of data main memory can feed them with. This is often termed as a the Von Neumann bottleneck which is basically the limited capacity of the memory bus between the processor and memory. Most of the slowdown of programs is generally attributed to the memory bottleneck of systems. This is also the reason why increasing the cache on a system significantly boosts the compute abilities of the system. Thus, with more of the working set data of a particular program available in the cache, the processor doesn't overload the memory bus that much. The surprising thing is that GPUs also use memory which is similar to the conventional memory present in PC systems. (which is higher clocked DDR2 SDRAM. GDDR3 is also very similar.) Thus, there has to be something more than just the memory bus being a bottleneck in the PC. Like in any complicated system, the devil in the memory subsystem is in its details. An excellent overview of the way memory works can be found in the article "RAM – Memory Technology Overview" at Anandtech. I will briefly summarize the details here. To get to the processor, the data from the memory has to go over two buses. One is the front side bus (FSB) and the other is the memory bus. Fetching data from the memory involves many steps. It involves the processor requesting the memory controller to send commands to the memory modules. These commands are used to activate a "row" of memory and then a "column" of data from it is requested. The memory module then sends a burst of data (of the requested column) back to the memory controller which then sends the data to the processor (cache whatever…). This burst happens at the peak memory rate which is equivalent to the (FSB clock speed * width of the FSB). For eg. in case of a Pentium 4 with an FSB of 800MHz (which is pretty common these days), the peak memory transfer rate is 800MHz*4bytes = 3.2GBps. In fact the actual bus clocks of both the FSB and the memory bus are different than 800MHz and are actually the same (!!). It is generally the case that the FSB is clocked at 200MHz and is quad pumped (effective clock rate = 800MHz) and has a width of 4 bytes, whereas the compatible memory module is clocked at 200MHz and is dual pumped (DDR) and has a width of 8 bytes. As you might imagine, this doesn't mean that the processor has 3.2GBps dedicated access to the system memory. The various steps involved in fetching data from the memory have their latencies. They take valuable clock cycles (in terms of the actual clock cycle and not the effective clock cycle). This makes it difficult to even get close to peak memory performance (i.e. at the peak transfer rate).
40GBps is a lot of memory bandwidth… but wait!
GPUs have very high memory bandwidth… of the range of 40GBps made possible by a combination of higher clocking, dual pumped bus and wide memory bus width. However, the important lesson we should take from them is that it is possible to achieve peak memory performance there. GPUSort achieves the peak memory performance for all the operations it does on the GPU (and hence is memory bus limited (again 😐 ) on the GPU). Where did all the memory command latencies go? Now, I don't know about the GPU internal hardware designs (does anybody besides the GPU guys?) but you can draw some logical conclusions from this. There is obviously some memory pipelining going on. In other words, the memory fetches and commands are timed and placed so well that latencies get hidden behind memory transfers. How could the GPU do that? This is precisely where the GPU compute model works wonders. A GPU generally has a number of "pipelines" through which computation is done. There at 16 pipelines in nVidia 7800 GT. Now, as I am told by GPU hackers, each compute pipeline calculates the value of a definite pixel on the screen. The programmer does not decide which pixel values is it calculating. The GPU decides that. Thus the GPU effectively pipelines the various operations of all its pipelines through its hardware resources including the memory. The job is made a lot simpler because the same program (called fragment programs) runs on all the pipelines.
Most PC systems have a single memory controller. And a single memory controller can do only so much. You do have dual channel memory controllers on some high end motherboards but I have not dug into the details on how effectively does it help pipeline memory requests. The conclusion we draw here is that the speed of the memory bus is not the problem for compute jobs on PCs. The non-effective pipelining of the memory requests and transfer bursts is more of the problem. I don't think I was ever able to achieve a rate of 3.2GBps of memory access. The highest I think I went in experiments was 1GBps (500MBps transfer of data from one buffer to another). I am sure I need much more careful experiments to figure out the exact numbers.
Where is all the compute? And why all this heat?
I thought a little before writing this section as I might be completely off on this one. CPU design is complicated and it is more complicated because of the fabrication process. However, I would like to present some insights while working on parts of GPUSort which made me aware of another CPU bottleneck (which I not surprisingly do find being addressed in the upcoming CPU designs). We all know that a CPU has (quite) a number of functional units inside. As instructions come to the CPU, the CPU sends parts of the instruction to various functional units where they get executed as soon as all the data they need to proceed is available. A scheduling engine like the Tomasulo engine is generally used for scheduling. Doing this provides instruction level parallelism which is vital for achieving performance from a processor. However doing this often means a whole lot of book keeping. The processor has to figure out the dependencies between the instructions before sending them to functional units. Then, the processor has to re-order the sequence in which the results are played back to the register file or main memory so that program semantics are not changed. All this is a fair amount of logic. It is not so widely known that this process doesn't scale either. This is because the complexity of interdependence of n instructions is proportional to n2. This basically means that we have to expend more and more computation to resolve the interdependence of instructions (and give off an equivalent amount of heat). The CPU designers seem to have come to some kind of conclusion that "it ain't scale no more".
The way I observed this was by checking the execution of a particular segment of my program for cache misses and branch mispredictions to find out the reasons for its slowdown. Intel vTune told me that there weren't as many cache misses and there weren't as many branch mispredictions either. It then struck me that all the instructions that I was trying to execute were highly dependent on each other. I really started leaning onto the "instruction dispatch bottleneck theory" when I observed the following. In a particular program segment, I had some computation and a 100byte memory copy repeated a number of times. In order to try to bunch memory copies together, I scheduled all the memory copies at a later time by doing some book-keeping. Then, after a whole bunch of compute was done I finally did the whole bunch of memory copy, where previously I was finely interleaving the compute and memory operations. The result was that the performance dropped by two times. Then I figured out that this memory copy was not dependent on any of the compute that was being done. I arrived at the conclusion that bunching the memory copies at the end basically hindered the parallelism between memory copies and the compute causing the significant slowdown.
The way upcoming CPUs are addressing this is by giving the programmer multiple channels to feed independent instructions. This might be either using Simultaneous Multi-Threading or SMT (Intel's hyperthreading) or by putting in multiple execution cores on the same processor. This brings the burden of separating dependent and independent instructions on the programmer who now has to multiprogram his application. Intel has also taken a radical and refreshing approach at this problem by proposing Explicitly Parallel Instruction Computing or EPIC for its Itanium line of processors. Instead of having multiple execution cores, these CPUs offload all the instruction scheduling off to the compiler and the application programmer and focus all the silicon on real compute. If done properly, this might be a good (and the only) avenue of raising single threaded performance of processors.