Defective Compass

Recent trends in Processor Architecture

Posted in Computing by defectivecompass on July 11, 2006

Even an average computer user knows about an uncomfortable upheaval in the processor scene. First, all of a sudden MHz no longer remained the king of all metrics to define a processor’s power. Then came the crazy processor naming/numbering conventions. And then came the hyper-threading / multicore phenomenon. Let me take you through a whirlwind tour of these and what I feel about the same.

Moore’s law predicts that processors will continue to show a relentless increase in speed and processing capability with time. Though true to an extent, chip making companies (Intel, AMD, IBM, Sun etc) actually make a lot of effort to stay on the Moore curve of progress. Some experts (yup, first hand information here!) have even said that its the chip makers who made Moore’s law that famous and relevant. Let us first look at Intel.

Intel jumped onto the bandwagon of aggressive “Out-of-Order” execution with Pentium Pro (or was it Pentium II?). Out of Order is a family of techniques which schedule your instructions in an order different from the order the processor received them to gain in performance (often a very big gain). Processors with Out of Order execution engines typically have an “Instruction Window” whose width (in number of instructions) tells you how many instructions can the processor execute concurrently. So far things were good. Things changed a little with the introduction of the Netburst architecture which formed the Pentium 4 line of processors. Netburst was designed so that every instruction could be divided into a very large number of pipeline stages. This would reduce the time taken for each stage and would help increase the clock speed of the processor to ridiculous amounts (3GHz is a really fast clock). I believe the hope was that such high pipelining could ensure that a larger instruction window would help parallelize the instructions in their respective stages using a relatively small number of functional units. In other words, all the processor functional units would be kept occupied. However, Intel struck two walls with this architecture.

  • Ramping up the clock speed meant increasing the core voltage of the processor which in turn increases its power consumption and heat dissipation. Pentium 4’s were notorious for being hot and power hungry.
  • While the strategy to finely subdivide the instructions was good, the bottleneck turned out to be instruction dispatch itself. It became impossible to increase the instruction window to larger numbers. Firstly, it is difficult to do dependency checking between the parallely issued instructions at high clock rates and low cycle times. More over, there just isn’t that much parallelism available in serial compiled C code. This results in leaving a lot of pipeline “holes” while running an instruction stream through the processor. Also, dependency checking and instruction dispatch logic on processors was becoming larger and more complex.

Everybody knows how AMD swooped in with their Athlons and Opterons and took a large portion of market/mindshare from Intel. Athlons never had the ridiculous clock speeds of Intel processors but consistently outperformed Pentium4s.

I believe Intel had a strategy to counter the problems they were having with Netburst quite some time ago. Also, there was the need to shift from a 32bit architecture to a 64bit architecure. I believe Intel thought that an architecture change from 32bit to 64bit was the right opportunity to change the underlying architecture altogether. I know I may get a lot of eye-brows for this but I think the Itanium was Intel’s solution. However, two things happened here which caused a big change in plans.

  1. Introduction of x86-64 by AMD, which was an extension to the existing x86 instruction set leading to a high performance 64 bit desktop/laptop processor. AMD was already doing better than Intel performance wise which led to the easy adoption of x86-64 not only on desktop but also in low to midrange server markets.
  2. Itanium didn’t do very well. God knows what happened but somehow Intel screwed up real bad. Itanium2 is trying to cover up for some of the damage. No wonder Intel moved to a completely new architecture (Core) for its x86 line.

The end result is that Intel is still holding aggressive Out of Order execution close to its chest and pushing the Itanium to a generation behind in fabrication process, giving its lots of cache and marketing it for high end servers.

This doesn’t mean that the problems with Out of Order execution are over. Itanium was designed specifically keeping this in mind. Itanium doesn’t have lots of logic for instruction dispatch. It relies on the compiler for supplying it instructions the appropriate way. You should look at EPIC for more details. The compiler has a lot more time to find parallelism in the instruction stream than the processor. Also, this opens an option for careful hand optimization which is not present in x86. Of course all this would have been good if Itanium was using the benefits of the fabrication process and engineering that Core et al enjoy. The strength of shfting instruction dispatch to the compiler comes in using the die space for more functional units or reducing the die size all together for lowering costs. This argument also makes the Itanium’s shift to high end servers look odd as the die size reduced by leaving intruction dispatch out is negligible compared to the die size increase due to cache increase etc. This strenthens my belief that Itanium was not targeted for the server markets initially.

With the increase in fabrication process technology, one can put even more number of transistors in a processor. However we just discussed that increasing the number of functional units in a processor wasn’t doing any good as there wasn’t enough parallelism in typical instruction streams and doing the dispatch on chip was getting more and more complex. The solution from the industry is multicore. Essentially, multicores replicate the entire processor core on the same chip so that two (or more) simultaneous threads of execution can run on one chip. Its much easier to do than say, doubling the instruction window and adding double the functional units. The performance in the specs becomes close to double and follows the fabrication process improvements in future very closely. This keeps chip makers happy. Almost all chip making companies have announced multicore chips in the present or the future including Intel, AMD, IBM and Sun. However, programming multicores is much more painful than programming in sequential C/C++. Now, desktop application developers have to multithread their programs (which is a hard programming paradigm) to take advantage of the numerous cores present in a processor. Servers on the contrary benefit from multicores as most applications running on them are by nature multithreaded. The biggest hit is probably being taken by the game developer community who now have to program for chips (Xenon on XBoX and Cell on PS3) which have lesser single threaded performance per core (by again throwing out the out of order dispatch logic) but more absolute computation power overall by adding much more cores (3 for Xenon and 9 for Cell). Though multicore architectures are in, the software industry, in general, has little idea how to tackle it especially on desktop applications.

Phew! would the industry say… as they escape a dead end of instruction dispatch (or Instruction Level Parallelism, ILP). However, in my opinion, its too early for rest. I believe there is yet another scalability problem coming before us as we step into this multicore era. The second problem will be of cache coherence. Multicore processors address the same memory and hence at some level in the memory hierarchy (L1 cache, L2 cache, L3 cache or the memory interface) they have to resolve conflicting accesses to memory. As the number of cores increase, multicores will suffer from increasing latencies to the shared cache or bus. Without going into details, let me cite the split of L2 data and instruction cache on the new Itanium2 (Montecito) as the first instance of such a problem (check out the last figure). Its no surprise that the Ultrasparc has very high cache latencies compared to Intel Core given its eight cores.

Now what? Unfortunately, I believe the chip giants will give in here too and push the problem to software tools and programmers. Infact, there is already a chip which seems to have traded programming convenience with performance on this front: The Cell. Cell is definitely multicore with its in-order PowerPC core and 8 separate SPEs (which seems to be a VLIW SIMD hybrid). In a way this is better than having all multicores look alike as you might delegate OS and device level stuff to the one unique core and devote the silicon of the rest of the cores to pure compute. However, I kinda favor the Itanium’s or Alpha’s PAL concept for doing OS support in a processor (shifting this complexity into software again). The distinguishing feature of the Cell is that its SPEs have dedicated local stores with them and it uses a DMA like technique to transfer data to and from these local stores. No cache coherency circuitry to automatically fetch the data but explicitly stated DMA transfers for the same. Of course the interconnect between the SPEs is very high bandwidth and you could draw a one to one correspondence between the Cell and a cluster of 8 computers for HPC. Scaling the Cell in the future is dead simple… increase the number of cores and/or storage per core. Programming for it is tough. Most compilers will need to be changed drastically to target the SPEs but the compiler also has the advantage of being dead-on in its predictions of the latencies of various operations.

There is yet another psuedo approach to the cache coherency problem. Its actually pretty subtle but it has got a lot of attention recently. Its the way GPUs (which have a large number of cores or “pipeline stages”) do memory. The GPU has a very fixed compute paradigm. Loosely speaking, its like a simple function which takes in read only arguments and computes an output. However, the GPU runs this same function on arrays of input values and writes to an array of output values. Very importantly, the GPU decides exactly how to schedule these function computations (which means that the function’s output has to be independent of the way it is scheduled) so that it could effectively pipeline memory accesses by the various compute cores actually doing the computation. I mentioned in my GPUTeraSort article that a peak memory bandwidth of close to 40GBps (which is the theoretical peak) could be achieved on the GPU while doing bitonic sort. The approach here is not to somehow eliminate the memory access latencies but somehow hide it effectively during computation. CPUs do have memory prefetching instructions which do something similar but GPUs show just how effective can it get when you know some details on the exact latencies of memory/cache operations and combine it with their compute model.

Exciting times ahead!

On Symantic Storage

Posted in Computing by defectivecompass on July 10, 2006

I came across the following project GLScube on “structured semantic storage” which I found very interesting. GLS is a storage system for Linux which does away with traditional filesystem concepts and introduces tags and hierarchically organized “virtual collections” (of tag based search predicates) for all filesystem data organization. While this is a great idea, one should not be carried away with buzzwords and throw out the established filesystems. It is very easy to see how this can be easily “emulated” using filesystems. It is as simple as using a “tags” folder somewhere in the filesystem and hard linking the “tagged” files into those folders. Virtual collections as system directories might require some kernel magic (like linux FUSE) but essentially it retains a hierarchy of tag based search predicates. This could be implemented in the user space using a simple text file which stores the names of these virtual collections and their tag based predicates in some format. A tool (commandline or graphical) could be used to show these vitual collections and list the files in the respective tag directories.

It should be noted that keeping the files in the respective tag directories also helps cluster the files under the same tag around the same portion of the disk for faster access. GLS might be able to pull off some more advanced clustering schemes with the overlap of files between tags. However, note that a filesystem with hard links doesn’t lack any information for similar clustering opportunities.

Content based search is a totally different beast. Spotlight (Mac OS X) and Beagle (Linux) use kernel driven filesystem notifications to find recently updated files and “crawl” over them to update a database with metadata for later search and retrieval. There are several engineering issues with maintaining the “crawl” as a strictly background operation and keeping the prefix based search on the database fast (for search-as-you-type applications).

Personally I find the UNIX hierarchical filesystem perfect. Tagging is like restricting a perfectly good idea. Perhaps the idea of keeping the users freedom on determining the filesystem layout ought to ge given a more serious thought.
A content based search engine over the UNIX filesystem should be more than adequate for needs of an average to power computer user.

An interesting dimension to the storage problem is “typing” the document (with say XML Schemas). Besides helping with “crawling” over the content of the document, I would like to mention another interesting thing one might do with it. In fact I had a post on the idea (XVM) some time ago.

The adoption of IPv6

Posted in Internet by defectivecompass on July 2, 2006

It has been more than a decade since IPv6 first hit news. The reason for its creation was to address the short address span of IPv4 (2^32 addresses). IPv6 supports 2^128 addresses which could enable every device on the planet to have its own unique IPv6 address.

I was going through this book “IPv6: The New Internet Protocol” by Christian Huitema some time ago which gave me a little more perspective into this beast. IPv6 is not just about the expansion of the address space to 128bits but a whole lot more. It also encompasses network autoconfiguration, security and multicast in a much cleaner design. There are features in IPv6 which help in much faster routing (like dropping per hop checksum validation). However, even these have not been enough to lure ISPs and other networks to adopt it. Most common reasons are the inertia of the present Internet, the added expense of gateways and support personnel.

Going through this book I found yet another technical reason which might hinder the adoption of IPv6. However, I never came across anything similar anywhere so I thought I would write this down. The current biggest nightmare in routing IPv4 on the Internet is the size of the routing tables. This happens mainly because the owner of the IPv4 address is the firm which has the end systems for them. When the firm moves somewhere else, or if the allocation is given to a different firm, the routers in the Internet must be configured so that packets for those addresses are now delivered to the new location.

An IP packet is forwarded depending on the ‘prefix’ of its destination address. A smaller prefix means a large set of addresses need to be forwarded towards the same direction (on the same link) and vice versa. Moving the location of a small set of addresses causes a large prefix to be added as an exception to the list of small prefixes in the routing table. If a large number of such moves happen then the routing table will be filled with a large number (tens of thousands) of such exceptions. Also, if these address moves span large geographic distances then the routing tables of many routers will be affected. A large routing table hinders fast search to find the appropriate destination of a packet and increases the cost of the router because of the need for faster processors and faster memory.

IPv6 was designed to overcome this problem too. The basic idea is that the IPv6 addresses are no longer the property of the firm having the end systems. The addresses belong to the ISP who’s geographical location (or its relative location in the ISP map) is well defined. Thus, addresses are allocated in a way (to these ISPs) such that no prefix exceptions are required while forwarding packets. This has two important side effects which are the additional reasons, I think, hindering the adoption of IPv6.

  • The fact that firms with end systems don’t own their addresses and the dynamic nature of IP addresses requires a change in the DNS configuration of the firm whenever the firm makes a geographical movement. This might not seem like big hassle but with IPv4 this was not present at all.
  • Whenever a firm leases lines with two or more ISPs it may have two or more addresses per end host system. Packets having one of these destination addresses will never go through the other ISP (they won’t forward packets on another prefix). Thus to take advantage of multiple ISPs leasing lines to a firm, one must deal with each of the address sets given to the firm by every ISP. IPv6 supports having multiple IP addresses for a given network interface. It is also possible to configure DNS such that a name on the Internet resolves to numerous IP addresses. However, this again puts more administrative stress on the DNS.

Firms often lease lines from multiple ISPs to improve reliability and performance of their access to the Internet. However, given the separation in the IP addresses being forwarded by the ISPs we loose this reliability and performance advantage. It is like the network shedding off another of its “maintenance intelligence” and hoping that an end-to-end principle would exist to deal with the same (very much like end to end TCP congestion control we have on the Internet now). Fortunately such an end to end technology is already present. SCTP or Stream Control Transmission Protocol is a transport protocol for the Internet like TCP. In addition to the features present in TCP it also supports transparent multihoming which is a fancy term for “multiple IP addresses”. Thus, an SCTP stream could connect to all the IPv6 address of the remote IPv6 capable system (which has two or more ISPs providing it Internet access) and have transparent failover from one set of IP addresses to another. Hopefully, because the abstraction for connecting to a site in both the IPv4/TCP and IPv6/SCTP cases still remains the same uptill the DNS resolution [addr_abstraction=gethostbyname(site_name); connect(addr_abstraction)], migration to this new paradigm will not be difficult.