Even an average computer user knows about an uncomfortable upheaval in the processor scene. First, all of a sudden MHz no longer remained the king of all metrics to define a processor’s power. Then came the crazy processor naming/numbering conventions. And then came the hyper-threading / multicore phenomenon. Let me take you through a whirlwind tour of these and what I feel about the same.
Moore’s law predicts that processors will continue to show a relentless increase in speed and processing capability with time. Though true to an extent, chip making companies (Intel, AMD, IBM, Sun etc) actually make a lot of effort to stay on the Moore curve of progress. Some experts (yup, first hand information here!) have even said that its the chip makers who made Moore’s law that famous and relevant. Let us first look at Intel.
Intel jumped onto the bandwagon of aggressive “Out-of-Order” execution with Pentium Pro (or was it Pentium II?). Out of Order is a family of techniques which schedule your instructions in an order different from the order the processor received them to gain in performance (often a very big gain). Processors with Out of Order execution engines typically have an “Instruction Window” whose width (in number of instructions) tells you how many instructions can the processor execute concurrently. So far things were good. Things changed a little with the introduction of the Netburst architecture which formed the Pentium 4 line of processors. Netburst was designed so that every instruction could be divided into a very large number of pipeline stages. This would reduce the time taken for each stage and would help increase the clock speed of the processor to ridiculous amounts (3GHz is a really fast clock). I believe the hope was that such high pipelining could ensure that a larger instruction window would help parallelize the instructions in their respective stages using a relatively small number of functional units. In other words, all the processor functional units would be kept occupied. However, Intel struck two walls with this architecture.
- Ramping up the clock speed meant increasing the core voltage of the processor which in turn increases its power consumption and heat dissipation. Pentium 4’s were notorious for being hot and power hungry.
- While the strategy to finely subdivide the instructions was good, the bottleneck turned out to be instruction dispatch itself. It became impossible to increase the instruction window to larger numbers. Firstly, it is difficult to do dependency checking between the parallely issued instructions at high clock rates and low cycle times. More over, there just isn’t that much parallelism available in serial compiled C code. This results in leaving a lot of pipeline “holes” while running an instruction stream through the processor. Also, dependency checking and instruction dispatch logic on processors was becoming larger and more complex.
Everybody knows how AMD swooped in with their Athlons and Opterons and took a large portion of market/mindshare from Intel. Athlons never had the ridiculous clock speeds of Intel processors but consistently outperformed Pentium4s.
I believe Intel had a strategy to counter the problems they were having with Netburst quite some time ago. Also, there was the need to shift from a 32bit architecture to a 64bit architecure. I believe Intel thought that an architecture change from 32bit to 64bit was the right opportunity to change the underlying architecture altogether. I know I may get a lot of eye-brows for this but I think the Itanium was Intel’s solution. However, two things happened here which caused a big change in plans.
- Introduction of x86-64 by AMD, which was an extension to the existing x86 instruction set leading to a high performance 64 bit desktop/laptop processor. AMD was already doing better than Intel performance wise which led to the easy adoption of x86-64 not only on desktop but also in low to midrange server markets.
- Itanium didn’t do very well. God knows what happened but somehow Intel screwed up real bad. Itanium2 is trying to cover up for some of the damage. No wonder Intel moved to a completely new architecture (Core) for its x86 line.
The end result is that Intel is still holding aggressive Out of Order execution close to its chest and pushing the Itanium to a generation behind in fabrication process, giving its lots of cache and marketing it for high end servers.
This doesn’t mean that the problems with Out of Order execution are over. Itanium was designed specifically keeping this in mind. Itanium doesn’t have lots of logic for instruction dispatch. It relies on the compiler for supplying it instructions the appropriate way. You should look at EPIC for more details. The compiler has a lot more time to find parallelism in the instruction stream than the processor. Also, this opens an option for careful hand optimization which is not present in x86. Of course all this would have been good if Itanium was using the benefits of the fabrication process and engineering that Core et al enjoy. The strength of shfting instruction dispatch to the compiler comes in using the die space for more functional units or reducing the die size all together for lowering costs. This argument also makes the Itanium’s shift to high end servers look odd as the die size reduced by leaving intruction dispatch out is negligible compared to the die size increase due to cache increase etc. This strenthens my belief that Itanium was not targeted for the server markets initially.
With the increase in fabrication process technology, one can put even more number of transistors in a processor. However we just discussed that increasing the number of functional units in a processor wasn’t doing any good as there wasn’t enough parallelism in typical instruction streams and doing the dispatch on chip was getting more and more complex. The solution from the industry is multicore. Essentially, multicores replicate the entire processor core on the same chip so that two (or more) simultaneous threads of execution can run on one chip. Its much easier to do than say, doubling the instruction window and adding double the functional units. The performance in the specs becomes close to double and follows the fabrication process improvements in future very closely. This keeps chip makers happy. Almost all chip making companies have announced multicore chips in the present or the future including Intel, AMD, IBM and Sun. However, programming multicores is much more painful than programming in sequential C/C++. Now, desktop application developers have to multithread their programs (which is a hard programming paradigm) to take advantage of the numerous cores present in a processor. Servers on the contrary benefit from multicores as most applications running on them are by nature multithreaded. The biggest hit is probably being taken by the game developer community who now have to program for chips (Xenon on XBoX and Cell on PS3) which have lesser single threaded performance per core (by again throwing out the out of order dispatch logic) but more absolute computation power overall by adding much more cores (3 for Xenon and 9 for Cell). Though multicore architectures are in, the software industry, in general, has little idea how to tackle it especially on desktop applications.
Phew! would the industry say… as they escape a dead end of instruction dispatch (or Instruction Level Parallelism, ILP). However, in my opinion, its too early for rest. I believe there is yet another scalability problem coming before us as we step into this multicore era. The second problem will be of cache coherence. Multicore processors address the same memory and hence at some level in the memory hierarchy (L1 cache, L2 cache, L3 cache or the memory interface) they have to resolve conflicting accesses to memory. As the number of cores increase, multicores will suffer from increasing latencies to the shared cache or bus. Without going into details, let me cite the split of L2 data and instruction cache on the new Itanium2 (Montecito) as the first instance of such a problem (check out the last figure). Its no surprise that the Ultrasparc has very high cache latencies compared to Intel Core given its eight cores.
Now what? Unfortunately, I believe the chip giants will give in here too and push the problem to software tools and programmers. Infact, there is already a chip which seems to have traded programming convenience with performance on this front: The Cell. Cell is definitely multicore with its in-order PowerPC core and 8 separate SPEs (which seems to be a VLIW SIMD hybrid). In a way this is better than having all multicores look alike as you might delegate OS and device level stuff to the one unique core and devote the silicon of the rest of the cores to pure compute. However, I kinda favor the Itanium’s or Alpha’s PAL concept for doing OS support in a processor (shifting this complexity into software again). The distinguishing feature of the Cell is that its SPEs have dedicated local stores with them and it uses a DMA like technique to transfer data to and from these local stores. No cache coherency circuitry to automatically fetch the data but explicitly stated DMA transfers for the same. Of course the interconnect between the SPEs is very high bandwidth and you could draw a one to one correspondence between the Cell and a cluster of 8 computers for HPC. Scaling the Cell in the future is dead simple… increase the number of cores and/or storage per core. Programming for it is tough. Most compilers will need to be changed drastically to target the SPEs but the compiler also has the advantage of being dead-on in its predictions of the latencies of various operations.
There is yet another psuedo approach to the cache coherency problem. Its actually pretty subtle but it has got a lot of attention recently. Its the way GPUs (which have a large number of cores or “pipeline stages”) do memory. The GPU has a very fixed compute paradigm. Loosely speaking, its like a simple function which takes in read only arguments and computes an output. However, the GPU runs this same function on arrays of input values and writes to an array of output values. Very importantly, the GPU decides exactly how to schedule these function computations (which means that the function’s output has to be independent of the way it is scheduled) so that it could effectively pipeline memory accesses by the various compute cores actually doing the computation. I mentioned in my GPUTeraSort article that a peak memory bandwidth of close to 40GBps (which is the theoretical peak) could be achieved on the GPU while doing bitonic sort. The approach here is not to somehow eliminate the memory access latencies but somehow hide it effectively during computation. CPUs do have memory prefetching instructions which do something similar but GPUs show just how effective can it get when you know some details on the exact latencies of memory/cache operations and combine it with their compute model.
Exciting times ahead!