Case against the filesystem

I think as we are seeing computing trends change with multi-devices, personal storage, backup and professional workflows, filesystems seem to be one odd technology. May be because of that, most tablets and smartphones do away with the filesystem idiom completely and allow the app experience to come through apps directly. It is known well that the filesystem is one of the most intimidating concepts for a beginner computer user to grasp and that it takes them a while to figure out the nuances around files (as pure data stripped out of any functionality), folders, file permissions and locality of that data. Given that, I think it is good thing that iOS and Android platforms on recent smartphone and tablets did away with these concepts.

However, there was something good about the filesystem that was lost too and I think this is a loss for power users and a portion of professional users (who like dealing with data directly, or whose job required dealing with data). Consider the use case of a filesystem in Unix, where small unix tools would interact with each other using pipes and files to get some serious data processing done. Or, consider the use case of a professional photographer who used the filesystem to dump results from one photo manipulation program to load into another for the second part of his workflow. Notwithstanding business aspirations of cloud storage providers who would like to sell ‘access’ to data to applications, instead of just storage capacity… looks like something was lost here.

Storing data persistently has been seen as the key problem a filesystem solves. I would argue that filesystems (try to) solve another important problem which is very relevant to power users: getting multiple programs to work together. Data persistence is a pure side effect. With the recent explosion in storage technologies behind many applications, data persistence can now be thought of as orthogonal to filesystems. For instance, spotify playlists our outlook email / calendaring obviously require storing data persistently, but we don’t interact with the filesystem directly to access that. The fact that outlook accesses the local filesystem and that spotify targets cloud storage are inconsequential from the above interpretation of the filesystem and were done purely to enable data persistence easily.

I think that the core use case of the filesystem is still very relevant today, esp. for power users: A technology to store data in application independent formats to exchange data between applications, and store it persistently while you are at it. Unix used text as the application independent format and it helped unix folks literally pull off insane looking tricks using a small set of tools like sed, awk, cut, paste, less etc. However, when Unix was created, almost all applications where text based, so it was easy to just standardize on text as the independent format. Nowadays, photos, video, audio, multi-lingual text and vector graphics dominate applications. Its hard to use the Unix filesystem model and its set of tools to do similar multi-program activities with the new forms of digital content. At the same time, the task is further confused because the filesystems expose complex constructs such as folders, symlinks and access restrictions which have nothing to do with the basic interaction between programs.

We should change this sad state of affairs… esp. for power users, and esp. for newer media types than just pure text. At the same time, we should try to fix only what’s broken leaving the folder (high level data organization) and access permissions as orthogonal problems. May be in the longer run, it might be a good idea keeping the folder and access permissions problems separated out anyway so that they could be taken further independently.

Hopefully the _problem_ being attacked is well conveyed by now, so let’s take a look at the proposed solution. I will take us through 4 takes on it, each one building up on the previous one to help pro users do better with the technology.

Bigger problems require (somewhat) complex solutions

The bigger problem here is really more complex applications which deal with various forms of digital data. The data may not be only text, but may have semantic meaning to it (for e.g. an email message would have Sender, Subject, Message… sections which apps need to interpret more than as pure text). The interactions between our mini programs may not be just sending text streams to and fro. They may be synchronized interactions between applications… may be the program needs an update when some data changes.
The first step is to think of filesystem data as not residing in binary files of arbitrary length, but think of them as persistent objects. These objects are usually very small in size as they describe something very basic (like Subject of an email, the image data of a photo, or one of the tags in our photo). These objects often link to other objects to form complex object structures. They are locally resident on the computer and provide very high performance access (much like unix filesystems) to the data to the programs accessing it. Very importantly, these objects can be inspected in human readable terms by standard tools. The object formats used are standard across a wide range of tools allowing the user to build complex workflows using tools.

While this might look like a tall order, it is not that hard to implement on a modern Linux system. We will use the filesystem to help us get data persistence along with using folders for high level organization and access control. Inside a folder we will have files with name <fileId>.<size> which implies that the file stores objects of size <size>. These files are of fixed lengths and for any given object size, once more objects are created, new files of exponentially increasing lengths are created. If objects are deleted then that space is marked as unallocated so that the next allocation for that object size can be done there.  Using fixed object size per file makes such allocations and de-allocations easy. The underlying filesystem takes care of larger and more variable block size allocations. Note that it is quite possible for us to accumulate a lot of garbage here, but hold on to that thought for a while.
Programs allocate objects in these files. These files are mmapped in the program’s address space so that they have very high performance access to the data. Different programs interact essentially using shared memory using synchronization primitives exposed as low level objects on those same files. Note that links between objects cannot directly be raw pointers because there is no guarantee that the files will be mapped at the same address in all invocations of the process. Thus, the link is broken into two: a fileId and an offset. Every process maintains a map from fileId to starting address to which the offset is added to get to the actual location of the object in the process’ address space.

This alone would take us very far. If the tools really agree on standard object structures then we could be looking at a very rich toolset suitable for a wide variety of workflows with present day structured text and digital media.

Adding Safety and Garbage Collection via types

The above scheme works well if our programs which allocate objects don’t leave a lot of garbage in the object store. If they do, we have no way of knowing how to compact those files because we don’t have a standardized way of knowing, across all objects, where inside the object are links to other objects embedded (to correct them upon compaction).

The other problem is type safety. Different programs are going to interact with the same object store. So, having a mechanism which helps them not overstep on each other private object structures would be good. Unix allowed overwriting files without regards to safety. Here, we will try to atleast match the object type description embedded in the program to the type information in the object store before we interchange data.

A simple typing scheme can be built above the typeless scheme described above. Essentially, we have a set of singleton type description objects in the object runtime and we include a link to the type description as the first member in every object stored in the object store. Accesses to objects are now checked against the data structure type information present in the program.

Adding types also means that we now know where the links are embedded inside objects. So, its definitely possible to do offline compaction. We can also have a set of generic tools which can pretty print / inspect the actual data present in an object store which could be useful.

True language independence

While most primitive types are represented in standard ways across all languages (integers, floats, links) some types are different. For eg. boolean may be represented differently in different languages. Struct layouts and array layouts could also be different. This creates a problem because we want to use mmapped shared memory across different processes (with different languages) to interact with this data. We could embed our own types in different languages which conform to these basic object types. However, it would be much better if we could reuse the existing bool, struct and array types in those languages directly for ease of programming.

Now this makes things hard. We can no longer do shared memory accesses for these objects. There would need to be a language independent representation of the objects for storing objects in the object store. Language implementations would make a _copy_ of the object in their own representation. However, now these copies can get out of sync with each other as different processes (in different languages) update and read this data. Infact, we would have to build a set of coherence rules for the various copies and how they interact with locking so that we could guarantee that objects would be coherent across programs once locks were acquired / released. This is not very different compared to the challenges seen with shared memory processing with locks, where after locks are released, the CPU cached copies of memory would need to be in sync. Its typically enforced by memory barriers (invoked inside lock acquire / release) which essentially invalidates local caches and provides a way to order reads and writes. This could definitely be done for our object runtime as well. Lock acquire and release cause updates of the object store from their local copies and invalidation of shared copies. Coherence would be undefined (or unusable) if locks are not used (similar to shared memory).

IMHO, such memory or cache barriers are very heavy. Upon a lock acquire or release, the runtime has no idea which of the shared objects actually need to be cache invalidated; so it invalidates all shared objects. With shared memory, the CPUs have come up with all kinds of tricks (MOESI cache coherence for instance) where CPUs track the state of shared cache lines and know when the different copies are out of sync. The algorithm there essentially tries to identify which cache lines are not really shared between different CPUs and avoids invalidating them upon memory barriers. However, if anything is shared, it will be invalidated upon memory barriers.
In software land, there may be better ways of doing data synchronization. One interesting way is Hoare’s CSP or Rob Pike’s channels / goroutines way of handling synchronization. In this mechanism, instead of acquiring locks, data is passed around different threads using blocking send / recv primitives. Shared access to some data between different thread is left undefined unless such sharing was co-ordinated with sends / recvs. This makes managing coherency of shared data much easier as now you have data context around synchronization events (blocking sends / recvs). Thus, no data other than what was sent using send() and received using recv() needs to be made coherent, making coherency much easier. The cleanliness of this abstraction is appealing enough to cause us to think why this is not done for CPU cache coherency / synchronization. But let’s not go there today :).

So, if we wanted to do true language independent object sharing, then we would need to do copies and we would need to employ an synchronization mechanism which can handle coherence between copies. Not surprisingly this is expected to add a lot of complexity to the object runtime.

Unix synchronization primitives at the shell or command line level never dabbled in shared memory directly. They were based on synchronizing over send / recv over pipes and files. So in a way, the best practices in Unix around composing programs together lean towards the send / recv model.

Distributed Objects

Its hard to neglect sharing data / synchronizing across networks. Some network filesystem (NFS, CIFS) are good in terms of correctness and performance, but in general their behavior around network failures is not good. Recently are we seeing ‘syncing’ services as the alternative and more failure proof way of handling distributed data. However, syncing services lack any synchronization primitive between program execution on the different nodes. So they are only useful with regular filesystem interacting apps when the user manually maintains that only one app is using the data being synced.

The question is, could we take our shared memory or send / recv based object sharing to distributed systems? Probably yes, but it is not expected to work very well and will have the same shortcomings as network filesystems. Synchronization latency is low when done locally on a single computer (compared to the latency of storing data for eg.). However, synchronization latency is high on local area networks and even higher and unpredictable on the Internet. When programming locally using shared memory or send / recv, we never have to think about failures in lock acquisition, or failures in send / recv. We will have to consider those facts with distributed objects which makes the programming model cumbersome. I think, more importantly, there needs to be a larger model shift with distributed objects because it is not possible to guarantee synchronization in a timely manner in distributed systems. We will need to give up synchronization or atleast make its role less critical when dealing with distributed systems. Let me motivate this via an example.

Let’s say I have an email program. In fashionable style, I have constructed a view script which inspects the object store to find a list of emails and emits a bunch of vector graphic descriptions in the object store. A separate ‘fetch email’ program regularly updates the list of emails which causes my view script to update the graphic descriptions, which I use another program to display on my screen or embed somewhere else. Locally, given that I can do synchronization in a timely manner, I _depend_ on it. I _expect_ that a fetch of email should update my display (there is no other way of knowing whether I really fetched something). However, my mental model with an email program changes when I am fetching email remotely. I _don’t expect_ that my local display will change in a timely manner if I have email remotely available on some email server.
I believe we can use this expectation of ours to keep the local programming model simple (without distributed failure assumptions). The question really becomes, what should we adopt as a good means of sharing / synchronizing data in distributed systems?

With distributed systems, I think we should allow for multiple versions of the same object structure to be present in different nodes. The programming model should encourage working with non-coherent copies of data and use synchronization lazily to arrive at results. The programming model should encourage parallel data fetches with synchronization to resolve the fetched data instead of synchronizing on which data needs to be fetched first and then fetching it (and then fetching the next etc.). Synchronization in distributed systems is done using a combination of polling / pub-sub which will be riddled with delays and failures. The idea is to extract out most of the programming model away from synchronization so that we don’t have to deal with failures there.

I think the runtime to do this would involve immutable objects with global ids and a mechanism to automatically fetch those objects from other nodes in the system. Synchronization primitives use polling / pub-sub underneath to find out which is the most recent version of data, but don’t expose an update locally unless all the related objects with that pub-sub event have also been fetched without failures. The use of immutable objects helps us keep track of updates to large and complex object structures by just adding a few objects to it, not to mention allowing aggressive caching of those objects across the network. This hides network errors behind the runtime, but at the same time gives us a mechanism of doing distributed data synchronization.

This can definitely be built on top of our local object store. IMHO, keeping its model separate from the local programming model will help us get the best of both worlds.

Posted in Computing, Internet | 1 Comment

Touch UI and Tablet devices for power and professional users

Is it only me who feels that the recent introduction of touch based UIs have dumbed down interfaces to a point where something as powerful as touch interface would not be lucrative for pro applications? The idioms used for touch interaction on smart phones and tablets are built for very simple interactions for mostly consumers. However, I think touch shouldn’t be limited to just that.

If we think of some professional activities before computing took over those spaces, we find that most of these activities depended on (and were enriched by) human object interactions. For eg. engineering drawing / drafting was done using rulers professionals could pick, place and manipulate and pencil sketching which gave a very satisfying tactile feedback. While computers make it easier for professionals to work through these engineering tasks, the question really is – can a touch interface make it better? We might have to take a step back from the regular swipe / poke gestures and think a bit generally about the possibilities from a high precision touch interface.

Infact, I would argue that we should take a look a professional computing setups from a fresh perspective given the advent of touch interfaces and multi-devices. May be we should bring the regular professional desk back, instead of trying to keep everything on a digital desktop spread across multiple vertical screens. Here is what I have in mind:

A professional desk is pretty much like a regular horizontal or slightly inclined desk. It has notepads, pens, clips, scissors and then may be a few tiny storage devices, tablets keyboard, mice etc. Most professionals can keep up with some complexity, so trying to unify everything on one computing device or screen might be worse off than letting different computing devices do what they are good at and interact with each other to let the user achieve what he wishes to. Different computing devices would have different weight and dimensions for the user to move around or manipulate. They may have different apps suiting the user’s workflow, some of them may be newer than others if the pro user wishes to integrate a better interface to his workflow.
Working would involve interacting using touch, or the keyboard / mice with multiple devices. For eg. the slider for a photo color saturation control could be on a touch device while the keyboard is used to quickly search for photos to touch. Or may be, the photo appears on a larger touch tablet at the center of the desk allowing the user to select photos and zoom / select a portion of the photo and a smaller tablet on the side is used for the slider controls for color manipulation. May be this side device is not a tablet at all but really a set of sliders with a small LCD panel on top showing some key interface elements.

I don’t think the keyboard will go away in the above scenario, but it may take various forms. Tactile feedback is important and so is the speed of entering text using the keyboard.

However, the above would need some serious platform software to pull this off. Different devices would need to work with each other. An app would need to spread its reach beyond just one device for the same user and for the same invocation of the app. The platform should allow the user to manage app and context around all the devices easily. The different devices might be running different versions of the platform software. The app would be interacting with different versions of platform software at the same time. The platform software on one device may have no idea what other devices in the personal network are capable off. Storage and compute resources are disaggregated so the app builders would use the platform for storage and compute heavy workloads and the platform would need to manage that well.

However, there are quite a few pleasant upsides. The professional user wouldn’t feel trapped inside a single computer. He would not have to upgrade the whole computer to get a better experience, which probably means he is going to try out many interconnecting devices to find out which works better. He wouldn’t have to shutdown his workflow to upgrade devices; he has a few of them so he can dynamically swap an upgraded one without interrupting his workflow. Upgrading local storage is easy, as is local compute capacity, as is upgrading interfaces.

P.S. I am not sure if USB or the personal area network (PAN) were supposed to address the scenarios outlined above. I do think that the above is different than one central computer controlling other peripheral devices. Having a central computer has disadvantages with upgradability to newer devices and workflow interruption on breakage / upgrade.

 

Posted in Computing | Leave a comment

Getting to personal car like speeds with Public Transport

I had this idea sometime back while I was commuting to work by bus. I used to commute by bus daily but then broke the trend and started driving. Driving to work for me is substantially faster (20 minutes vs. 55 minutes) though the drawback is that, well, I need to drive instead of leisure reading which I enjoy doing on the bus.

Recently, one day I had to commute by bus again, and it got me comparing the two modes of transport. The bus takes a faster route to work than I do by car (because it can use the carpool lane on a highway which is very crowded otherwise). The main reason why the bus takes so much time is because it stops very frequently. The many stops add up fast. The same logic holds for trains as well.

OTOH, we kind of already know how to get around it. We have express routes. These are routes which have very few stops. The problem is that there are few of them.

I think there may be a way to solve this problem. The crux of the idea is to classify public commuters according to their destination once they have boarded. We also need to come up with a mechanism to not stop commuters at stops which are not their destination.

One way to achieve this is to re-imagine public transport as a subway or a rail based system with multiple speed tiers. Every speed tier has its own set of rails. There is the 0-tier at which stops are made. The rails for the 0-tier are closest to the platform to let passengers in/out. The subway compartments in higher tiers don’t stop at all. The different tiers, however, allow subway compartments to transfer between them. This may involve transferring a compartment between sets of rails.

Thus, a subway compartment may start in the 0-tier at a stop, but it will transfer to a higher tier once the compartment has started travelling and has achieved some speed. The transfer eventually will cause the compartment to be inserted in the middle of other compartments in a running train in the higher tier. Once the transfer is complete, the doors at the end of the compartment are opened and the passengers are requested to walk to the compartment representing their destination. Once the destination is about to arrive, the compartment for that destination transfers to the 0-tier (without stopping the rest of the compartments) and ultimately stops. It does so by first separating itself from the rest of the train and moving to the rails of the 0-tier. Once it is completely independent, it slows down and stops to let passengers in/out. This allows the compartments in the higher tiers to never stop, thus allowing passengers to get to their destination without having to stop anywhere.

The key piece is to build the transfer mechanism between tiers. However, it doesn’t look undoable and requires some clever engineering to get done. The other difficult piece is managing scheduling of compartments between different speed tiers. Other than that, the additional cost is in building the 0-tier rails at each stop. These 0-tier rails don’t need to extend very far (just enough to get the 0-tier compartments up to speed of the next tier).

Doing this for road transport is tougher because buses tend to shake a lot on roads which will make it difficult for passengers to cross over to the destination compartments. Also, building this over rails will allow much higher speeds than those achievable on roads which would incentivize public transport.

Posted in Life, Travel | Leave a comment

Internet Browsers and Privacy

I came across some articles around the web on the rise of user tracking on the Internet and how some web companies achieve that. While I could see a whole bunch of addons for firefox / chrome around privacy, the arms chest for web companies is growing as well. They now use very long term cookies, flash cookies, DOM storage and all kinds of tricks with Javascript to track users. For eg. under certain circumstances (using an embedded iframe in a web page) it is possible for a website (say a news website) to read cookies which were created by another website (say a social networking website). If the news website embeds a page from the social network in an iframe the social network iframe will have the user’s context information from the social website’s cookie available in the same context as the news website’s page that the user is currently viewing. Essentially, the social network website will have knowledge of the actual news pages the user is reading. Such embedding if fairly common place (facebook like button is an example) and the current browsers fail to give a choice to the user till the extent they want to be tracked. Firefox sends a Do Not Track hint to websites but its only up to the sites to honor them.

To make matters worse, most of these websites are providing us with services we really want, but we would not like them to track what else we do on the internet, or what have we been doing on the internet over long periods of time. So blocking cookies altogether is really not an option. What we really need is a way to isolate different web apps from one another and across their usage at large timescales (weeks / months). If we are able to achieve good isolation, we really don’t care how bonkers do these websites go with cookies (at short timescales).

I tried to find a reasonable workaround for web tracking and have come to the following which works reasonably well. It doesn’t use any special addons for browsers:

Most browsers support multiple user profiles. Besides flash cookies (which needs to be dealt separately), the profile carries the “cookie jar” and the persistent DOM storage for which contains all tracking information. Thus, we should be able to create multiple profiles and run independent instances of the browser for most websites we visit. Very importantly, we can run multiple instances of the browser simultaneously each having their own cookie jars and DOM storage so that there is no chance of cross site cookie sniffing.

So, the idea is to keep a master copy of a profile with all your preferred browser settings and, using a script, copy it over to a temporary profile before the browser is launched (with the temporary profile). Once the browser is closed, the temporary profile is deleted. If multiple browsers instances are launched, each will be launched in their own temporary profiles so that you can have as many isolated independent instances as you need.

So far, I haven’t been able to figure out a good UI convention for the user to communicate this isolation. The best I have now is that a browser window is a unit of isolation and all tabs in the browser are in the same temporary profile. The solution above using browser profiles doesn’t limit you from opening multiple windows though, I personally manage it using multiple virtual desktops. However, this is still an open problem and once I have it I will probably go forward and try implementing it for a browser. If you have a suggestion for a UI convention for browser session isolation please leave a comment.

I implemented the above idea for chrome (chromium actually) and firefox. Chrome supports profiles using a user-data directory which can be specified on the command line. Firefox supports command line options to create a new profile. Thus, my chromium launching script in bash looks like

tmpProfileName=`date | md5sum | cut -d " " -f 1`
cp -a <location of master profile> <prefix of temprorary profiles>/$tmpProfileName
chromium --user-data-dir=<prefix of temprorary profles>/$tmpProfileName
rm -r <prefix of temprorary profiles>/$tmpProfileName

and my Firefox launching script looks like:

tmpProfileName=`date | md5sum | cut -d " " f 1`
firefox -no-remote -CreateProfile $tmpProfileName
cp -a $HOME/.mozilla/firefox/<master profile folder>/* $HOME/.mozilla/firefox/*.$tmpProfileName/
firefox -no-remote -P $tmpProfileName
rm -r $HOME/.mozilla/firefox/*.$tmpProfileName

Opera should be similar to chrome as it has a user-data directory for profiles similar to chrome.

This will also help you log in using multiple accounts for the same website in different browser windows simultaneously. Essentially, its like having many simultaneous active but independent instances of the "incognito" or "private browsing" modes.

Note that, if you do need persistent cookies for some reason, you can always launch the browser without the above script. It will store the cookies and other data in the default profile (separate from the master profile which is used to copy over settings to temporary profiles). However, I don't think you will ever need that... if you do, just keep the browser window open indefinitely.

Update: The windows cmd script to achieve the same with firefox is below.

set profileDir=C:\Users\<username>\AppData\Roaming\Mozilla\Firefox\Profiles\
set profileId=%random%
"C:\Program Files (x86)\Mozilla Firefox\firefox.exe" -no-remote -createprofile %profileId%
for /D %%i in (%profileDir%\*.master) do set masterProfileDir=%%i
for /D %%i in (%profileDir%\*.%profileId%) do set newProfileDir=%%i
xcopy /Q /E /Y %masterProfileDir% %newProfileDir%
start /wait "Starting Firefox..." "C:\Program Files (x86)\Mozilla Firefox\firefox.exe" -no-remote -P %profileId%
rmdir /S /Q %newProfileDir%

The above includes some rather ugly hacks to determine the profile dirs... but it works. Please use the correct path to firefox.exe in the above script. The location of firefox.exe can be determined by (Right Click on Firefox icon)->Properties.

Posted in Internet | Leave a comment

Multi-log structured storage layer

I recently had to build an SSD garbage collector which proved to be an interesting and fun exercise. SSD or Solid State Drives are storage devices that promise low latency operations compared to mechanical hard drives which are prone to seek / rotational latency. While reads to SSDs are low latency, writes are a different story. Flash media (which is the basis of most SSDs) requires “erasing” a block (typically 128K) of flash before anything can be written on it. Also, block erasure takes a lot of time (compared to actually reading / writing). Most SSDs have firmware which handles pre-erasing a bunch of blocks so that the write latency can be minimized. Also, given that blocks cannot be partially written, the valid content on blocks need to be re-written to make the free space available on partial blocks available for newer content. SSD firmware is, hence, also responsible for this “garbage collection” process. A good starting point for further reading on SSD garbage collection is Wikipedia: Write Amplification.

Note that garbage collection can be a processor heavy operation given the data movement involved. Recently, I designed and implemented a multi-log structured storage layer which I ran on top of an SSD to relieve it of the garbage collection task (which it was performing poorly). However, such a multi-log storage layer is very versatile and can have a number of other applications.

A multi-log storage layer is essentially a storage layer in which storage content is laid out in multiple contiguous logs. While I kept the metadata for the content on the SSD in RAM, it is possible to tweak the implementation to keep metadata in the log as well. The primary reason to divide the content into multiple logs is to keep the garbage collection efficient (not re-writing too much data) when the storage space is close to full. However, similar principles could also be used for garbage collection in RAM (in garbage collection supporting language runtimes such as Java). A muti-log storage layout can also be used over hard drives to get good sequential write performance, though the reads would need to be supported by a layer of caching to avoid too many seeks.

Multi-log Storage Layer Design

First, the storage layer (whether backed by a file, or a device, or even RAM) needs to divided into multiple fixed size contiguous logs. These logs should be large enough that IO writes at that size can be done at near peak write bandwidth. At the same time, the log size should be small enough so that we see a _large_ variance in the amount of garbage seen across logs. This variance in garbage will allow us to select logs with more garbage for garbage collection. It is advisable to use the smallest log size at which we can achieve near peak read/write bandwidth. The storage layer will mostly write full logs at any time. If durability requirements of the data are relaxed (such as in a cache) then one can do away with forced flushes of data and flush logs _only_ when a full log’s worth content is available. Reads may require random access across logs, but it should be okay because random access has no cost on SSD and we shouldn’t be iop bound because of the sizing of the log.

The Write Queues

An incoming write is kept in a RAM write queue. If the RAM write queue is designed using blocks of RAM of the same log size, it will allow us to read the write queue RAM logs in a manner similar to the regular logs while reading content. Otherwise, reading from content in the write queue would need to be coded separately.

It should be possible to enable concurrent writes into the write queue by keeping allocation of content blocks separate from the actual write (memory copy). The allocation of content blocks would required synchronized access to (only) a current_RAM_log and a current_RAM_log_offset. Once the allocation has been performed, multiple writes across multiple RAM logs at different offsets can happen concurrently. Once the writes complete, they should update a bytes_committed variable present in every RAM log (under a lock or using CAS). Once the bytes_committed is equal to the log size, the RAM log is ready to be committed to media.

A (synchronized) list of to-be-committed RAM logs is maintained and a task is spawned up as soon as there are to-be-committed RAM logs to commit them to media. Care should be taken to not have more than one log write on any device at any point in time. This will avoid randomizing the device firmware with multiple simultaneous random writes.

Content Block handles

A content block allocation also involves creation of a block handle to be given back to the higher software layers. This block handle will be used by the higher layers to read the block (possibly multiple times) and then delete it. Writes to the same block shouldn’t be supported as this will randomize the media firmware and break the one log size write at any point in time rule. Instead, writes should always go through new block creation.

A block handle encapsulates an ordered pair of (log, offset [into the log], size [after the offset]) tuples. In most cases, only the first tuple would be sufficient for the allocation request. However, in cases current_RAM_log in the write queue doesn’t have enough space for the allocation request the second tuple denotes the rest of the allocation. This also means that an allocation request cannot be larger than the log size. However, an aggregate data structure on top of this tuple pair could be built to address that. The tuple pair also comes handy during garbage collection when the content block is re-written to the media. Using a tuple pair structure puts a cap on the max size of data movement at any given point in time during garbage collection. Thus, large sized blocks (using an aggregate data structure at a higher layer) would be moved partially upon garbage collection improving garbage collection efficiency. The other advantage is the simplicity of using just a tuple pair thus avoiding code complexity and allocation/manipulation of list data structures.

Content Block Deletion – Garbage Accounting/Collection

When content blocks are deleted, a (protected) garbage_size variable on the log is updated to reflect the new and larger garbage size. Note that a content block deletion can cause upto two garbage size updates on the upto two logs it points to. A max-heap (ordered w.r.t. garbage size) of the logs is maintained and updated upon block deletion. The top of the heap (max garbage size) is the most eligible log for garbage collection. Note that, this won’t give us SSD wear-leveling, but 1) we can depend on the device firmware to do that 2) it can be done at the max-heap by suitably designing a metric to combine no. of writes with garbage size.

Garbage collection can be triggered whenever a (small) reserved pool of empty SSD logs is below its threshold. The RAM logs use this pool of SSD logs to flush their content into – which triggers garbage collection. The garbage collection task picks a log from the max-heap outlined above and starts re-writing the valid blocks present in it back to the storage layer. It can use the same read and write APIs which are used by the client for accessing the storage layer. Once all the valid blocks in a log are re-written, the log can be given back to the reserved pool of empty SSD logs.

Note that client writes are dependent on space in the RAM log based write queue which is in-turn dependent on space present in the reserved pool of empty SSD logs which is in-turn dependent on garbage collection. If garbage collection depends on the same write APIs as the client, then it will complete a circle of dependency back to the RAM log based write queue. To prevent deadlock, the write API reserves a few RAM logs for GC induced writes but doesn’t use those logs for client writes. This will break the deadlock.

Data structures and Locking

Here’s an outline of the data structures described above:

Storage Layer:

    • Storage Layer mutex (protects all the data structures below in the Storage Layer)
    • Write queue of RAM logs, current_RAM_log, current_RAM_log_offset
    • To-be-committed list of RAM logs
    • Reserved pool of SSD logs
    • Max heap of logs w.r.t garbage size

Log:

    • Log RW lock
    • Unordered Set of valid content blocks
    • Garbage Size
    • (For RAM logs only)
      • Pointer to memory
      • bytes_committed (used in write API for supporting concurrent writes)
    • (For SSD logs only)
      • SSD device / file
      • SSD offset into device / file

Content Block:

    • pair of (log, offset, size) tuples

The Storage Layer mutex is taken to protect the various free lists, write queues and the allocation log and offset. The log’s RW lock protects the log’s metadata (garbage size, set of content blocks). Note that the actual content doesn’t need any locking as once committed it is immutable. During read, we Read lock the locks on the pair of logs for the content block. This read locks protects against garbage collecting the logs (which would take write locks on them to change their garbage size and/or set of valid content blocks). Note that the locks taken while reading should honor lock ordering to avoid deadlocks. A simple scheme is to just use the log data structure’s address as the lock order. Technically, we could allow for larger concurrency by allowing append modifications to set of valid content blocks in RAM logs, or by allowing removal of content blocks and increase of garbage size if the log is not selected for garbage collection. However, the above locking scheme works well in practice and I could see no surprising bottlenecks caused by lock contention.

Posted in Computing | Leave a comment

Threadpool/Async vs Multithreading

Threadpools and async IO are all rage these days for writing highly concurrent servers. As far as I can remember this is a relatively new development and a few years ago, highly concurrent servers were written using async IO (without threadpools… there weren’t so many cores then :) ) or using multi-threading. What changed? Are servers designed using async / threadpools better compared to multi-threaded ones? Why?

Multi-threading involves using OS threads to schedule a bunch of concurrent tasks. The OS threads save and restore the entire processor state when a processor core switches the currently executing thread. Processor state can be large for 64 bit processors (16 64bit general purpose registers and 16 128 bit SSE registers = 16* (8 + 16 bytes) = 384 bytes without counting the segment, debug and control registers). There is also the issue of switching to a new stack which may be at a different location in memory and may be cache cold. So if there are threads which context switch very frequently (say in every few instructions) then the context switch overhead can be disproportionately high.

Async / threadpool helps this scenario by keeping the OS thread the same on a core (so keeping the top of the stack hot in the cache) and just calling into different tasks (essentially userspace functions or methods) enqueued in a queue. Switching between tasks doesn’t require saving any processor state because the functions save as much register state as it uses as it moves forward in the code path. This favors small tasks because the little processor state is saved/restored and the top of the stack always remains hot.

However, async / threadpools have a significant drawback. If the programmer fails to split its tasks evenly in terms of resource consumption (some tasks taking disproportionately large number of CPU cycles) then the async / threadpool system adds significant latency to small tasks queued up behind large ones. This problem doesn’t arise with threading because long running OS threads are preempted by the kernel to give CPU time fairly to other threads in the system.

The situation can be ameliorated by increasing the number of threads in the threadpool to be somewhat larger than the number of cores. This can allow some small tasks to be queued (and subsequently executed with a minor hit in latency) in threads which don’t have large tasks. However, this fails to solve the problem completely. You can never be sure of how many threads should you have in the threadpool simply because in most cases you don’t know how many long running tasks will you have in the system. Furthermore, if you have too many threads in the threadpool then you will end up context switching between them needlessly.

A better solution to the problem is to let the programmer break long running tasks into smaller tasks and schedule the subsequent tasks at the completion of the preceding tasks. This will allow for some amount of fairness in the task queue. However, this seems like asking the programmer to solve the same problem for every task he creates instead of having a single proven-to-work solution such as threading take care of the problem.

Is there a good solution to this problem? How about the following:
We bring preemptive scheduling up from the kernel into userland. Essentially, the threadpool has timers on each of its threads / cores and if the timer fires before the current task is finished then the task is suspended (will require saving of processor state) and is enqueued at the end of the thread’s task queue. Windows supports Get/SetThreadContext() APIs which can help with this without creating new threads. Otherwise, one can always create new threads to resume the queued up tasks while keeping the current thread (one with the long task) suspended. The threads given up after preempted tasks are completed can be reused to satisfy more thread creations.

The above will achieve the best of both worlds. It will allow small task switching to be efficiently done in the userspace and at the same time allow long running tasks to be handled preemptively, thus reducing latency issues with async / threadpool systems.

Posted in Computing | 7 Comments

Cloud computing APIs and Timeouts

Cloud computing APIs like other system or application APIs have their success and failure modes. However, cloud computing APIs have another failure mode which most system or application APIs don’t have – timeouts. One problem with timeouts is that they take time :), which implies that whichever higher level operation is being performed, it generally has much higher latency before a response can be gathered for the user (which most probably is going to be an error).

There is another problem with timeouts. Timeouts are required in distributed systems because of the problem of consensus [ http://en.wikipedia.org/wiki/Consensus_(computer_science) ] also known as the FLP impossibility result. Cloud computing APIs require data consistency (consensus on data) all the time but they cannot fundamentally achieve it. So they shift the problem to consensus in mutual time. A conservative timeout is an engineering approximation of a consensus on completion of the operation with an error.

Now, computer clocks rarely run at the same rate. The maximum possible difference between computer clock rates is called clock skew. This clock skew error must be added to every cloud computing API one is using so that we are really sure that the cloud computing operation beneath the API has completed with an error.

Normally, this is not a big problem. However, with a proliferation of the cloud APIs, and increased composition and layering of these APIs, the clock skew error needs to be added to the timeouts at every layer. This results in highly inflated timeout values at the end user.

So what can we do? IMHO, the first thing to do is to reduce timeouts at their origins. Most timeouts originate with heartbeats between components and it is important to have smaller heartbeats. The heartbeats cannot be made very frequent because it uses up network iops. However, its definitely something that should be tuned. Also, a timeout of an operation is often the maximum of all the timeouts that we will experience in sub-operations. Thus, sub-operations should be chosen such that their max. value is lower than other alternatives.

Posted in Computing, Internet | Leave a comment