Case against the filesystem

I think as we are seeing computing trends change with multi-devices, personal storage, backup and professional workflows, filesystems seem to be one odd technology. May be because of that, most tablets and smartphones do away with the filesystem idiom completely and allow the app experience to come through apps directly. It is known well that the filesystem is one of the most intimidating concepts for a beginner computer user to grasp and that it takes them a while to figure out the nuances around files (as pure data stripped out of any functionality), folders, file permissions and locality of that data. Given that, I think it is good thing that iOS and Android platforms on recent smartphone and tablets did away with these concepts.

However, there was something good about the filesystem that was lost too and I think this is a loss for power users and a portion of professional users (who like dealing with data directly, or whose job required dealing with data). Consider the use case of a filesystem in Unix, where small unix tools would interact with each other using pipes and files to get some serious data processing done. Or, consider the use case of a professional photographer who used the filesystem to dump results from one photo manipulation program to load into another for the second part of his workflow. Notwithstanding business aspirations of cloud storage providers who would like to sell ‘access’ to data to applications, instead of just storage capacity… looks like something was lost here.

Storing data persistently has been seen as the key problem a filesystem solves. I would argue that filesystems (try to) solve another important problem which is very relevant to power users: getting multiple programs to work together. Data persistence is a pure side effect. With the recent explosion in storage technologies behind many applications, data persistence can now be thought of as orthogonal to filesystems. For instance, spotify playlists our outlook email / calendaring obviously require storing data persistently, but we don’t interact with the filesystem directly to access that. The fact that outlook accesses the local filesystem and that spotify targets cloud storage are inconsequential from the above interpretation of the filesystem and were done purely to enable data persistence easily.

I think that the core use case of the filesystem is still very relevant today, esp. for power users: A technology to store data in application independent formats to exchange data between applications, and store it persistently while you are at it. Unix used text as the application independent format and it helped unix folks literally pull off insane looking tricks using a small set of tools like sed, awk, cut, paste, less etc. However, when Unix was created, almost all applications where text based, so it was easy to just standardize on text as the independent format. Nowadays, photos, video, audio, multi-lingual text and vector graphics dominate applications. Its hard to use the Unix filesystem model and its set of tools to do similar multi-program activities with the new forms of digital content. At the same time, the task is further confused because the filesystems expose complex constructs such as folders, symlinks and access restrictions which have nothing to do with the basic interaction between programs.

We should change this sad state of affairs… esp. for power users, and esp. for newer media types than just pure text. At the same time, we should try to fix only what’s broken leaving the folder (high level data organization) and access permissions as orthogonal problems. May be in the longer run, it might be a good idea keeping the folder and access permissions problems separated out anyway so that they could be taken further independently.

Hopefully the _problem_ being attacked is well conveyed by now, so let’s take a look at the proposed solution. I will take us through 4 takes on it, each one building up on the previous one to help pro users do better with the technology.

Bigger problems require (somewhat) complex solutions

The bigger problem here is really more complex applications which deal with various forms of digital data. The data may not be only text, but may have semantic meaning to it (for e.g. an email message would have Sender, Subject, Message… sections which apps need to interpret more than as pure text). The interactions between our mini programs may not be just sending text streams to and fro. They may be synchronized interactions between applications… may be the program needs an update when some data changes.
The first step is to think of filesystem data as not residing in binary files of arbitrary length, but think of them as persistent objects. These objects are usually very small in size as they describe something very basic (like Subject of an email, the image data of a photo, or one of the tags in our photo). These objects often link to other objects to form complex object structures. They are locally resident on the computer and provide very high performance access (much like unix filesystems) to the data to the programs accessing it. Very importantly, these objects can be inspected in human readable terms by standard tools. The object formats used are standard across a wide range of tools allowing the user to build complex workflows using tools.

While this might look like a tall order, it is not that hard to implement on a modern Linux system. We will use the filesystem to help us get data persistence along with using folders for high level organization and access control. Inside a folder we will have files with name <fileId>.<size> which implies that the file stores objects of size <size>. These files are of fixed lengths and for any given object size, once more objects are created, new files of exponentially increasing lengths are created. If objects are deleted then that space is marked as unallocated so that the next allocation for that object size can be done there.  Using fixed object size per file makes such allocations and de-allocations easy. The underlying filesystem takes care of larger and more variable block size allocations. Note that it is quite possible for us to accumulate a lot of garbage here, but hold on to that thought for a while.
Programs allocate objects in these files. These files are mmapped in the program’s address space so that they have very high performance access to the data. Different programs interact essentially using shared memory using synchronization primitives exposed as low level objects on those same files. Note that links between objects cannot directly be raw pointers because there is no guarantee that the files will be mapped at the same address in all invocations of the process. Thus, the link is broken into two: a fileId and an offset. Every process maintains a map from fileId to starting address to which the offset is added to get to the actual location of the object in the process’ address space.

This alone would take us very far. If the tools really agree on standard object structures then we could be looking at a very rich toolset suitable for a wide variety of workflows with present day structured text and digital media.

Adding Safety and Garbage Collection via types

The above scheme works well if our programs which allocate objects don’t leave a lot of garbage in the object store. If they do, we have no way of knowing how to compact those files because we don’t have a standardized way of knowing, across all objects, where inside the object are links to other objects embedded (to correct them upon compaction).

The other problem is type safety. Different programs are going to interact with the same object store. So, having a mechanism which helps them not overstep on each other private object structures would be good. Unix allowed overwriting files without regards to safety. Here, we will try to atleast match the object type description embedded in the program to the type information in the object store before we interchange data.

A simple typing scheme can be built above the typeless scheme described above. Essentially, we have a set of singleton type description objects in the object runtime and we include a link to the type description as the first member in every object stored in the object store. Accesses to objects are now checked against the data structure type information present in the program.

Adding types also means that we now know where the links are embedded inside objects. So, its definitely possible to do offline compaction. We can also have a set of generic tools which can pretty print / inspect the actual data present in an object store which could be useful.

True language independence

While most primitive types are represented in standard ways across all languages (integers, floats, links) some types are different. For eg. boolean may be represented differently in different languages. Struct layouts and array layouts could also be different. This creates a problem because we want to use mmapped shared memory across different processes (with different languages) to interact with this data. We could embed our own types in different languages which conform to these basic object types. However, it would be much better if we could reuse the existing bool, struct and array types in those languages directly for ease of programming.

Now this makes things hard. We can no longer do shared memory accesses for these objects. There would need to be a language independent representation of the objects for storing objects in the object store. Language implementations would make a _copy_ of the object in their own representation. However, now these copies can get out of sync with each other as different processes (in different languages) update and read this data. Infact, we would have to build a set of coherence rules for the various copies and how they interact with locking so that we could guarantee that objects would be coherent across programs once locks were acquired / released. This is not very different compared to the challenges seen with shared memory processing with locks, where after locks are released, the CPU cached copies of memory would need to be in sync. Its typically enforced by memory barriers (invoked inside lock acquire / release) which essentially invalidates local caches and provides a way to order reads and writes. This could definitely be done for our object runtime as well. Lock acquire and release cause updates of the object store from their local copies and invalidation of shared copies. Coherence would be undefined (or unusable) if locks are not used (similar to shared memory).

IMHO, such memory or cache barriers are very heavy. Upon a lock acquire or release, the runtime has no idea which of the shared objects actually need to be cache invalidated; so it invalidates all shared objects. With shared memory, the CPUs have come up with all kinds of tricks (MOESI cache coherence for instance) where CPUs track the state of shared cache lines and know when the different copies are out of sync. The algorithm there essentially tries to identify which cache lines are not really shared between different CPUs and avoids invalidating them upon memory barriers. However, if anything is shared, it will be invalidated upon memory barriers.
In software land, there may be better ways of doing data synchronization. One interesting way is Hoare’s CSP or Rob Pike’s channels / goroutines way of handling synchronization. In this mechanism, instead of acquiring locks, data is passed around different threads using blocking send / recv primitives. Shared access to some data between different thread is left undefined unless such sharing was co-ordinated with sends / recvs. This makes managing coherency of shared data much easier as now you have data context around synchronization events (blocking sends / recvs). Thus, no data other than what was sent using send() and received using recv() needs to be made coherent, making coherency much easier. The cleanliness of this abstraction is appealing enough to cause us to think why this is not done for CPU cache coherency / synchronization. But let’s not go there today :).

So, if we wanted to do true language independent object sharing, then we would need to do copies and we would need to employ an synchronization mechanism which can handle coherence between copies. Not surprisingly this is expected to add a lot of complexity to the object runtime.

Unix synchronization primitives at the shell or command line level never dabbled in shared memory directly. They were based on synchronizing over send / recv over pipes and files. So in a way, the best practices in Unix around composing programs together lean towards the send / recv model.

Distributed Objects

Its hard to neglect sharing data / synchronizing across networks. Some network filesystem (NFS, CIFS) are good in terms of correctness and performance, but in general their behavior around network failures is not good. Recently are we seeing ‘syncing’ services as the alternative and more failure proof way of handling distributed data. However, syncing services lack any synchronization primitive between program execution on the different nodes. So they are only useful with regular filesystem interacting apps when the user manually maintains that only one app is using the data being synced.

The question is, could we take our shared memory or send / recv based object sharing to distributed systems? Probably yes, but it is not expected to work very well and will have the same shortcomings as network filesystems. Synchronization latency is low when done locally on a single computer (compared to the latency of storing data for eg.). However, synchronization latency is high on local area networks and even higher and unpredictable on the Internet. When programming locally using shared memory or send / recv, we never have to think about failures in lock acquisition, or failures in send / recv. We will have to consider those facts with distributed objects which makes the programming model cumbersome. I think, more importantly, there needs to be a larger model shift with distributed objects because it is not possible to guarantee synchronization in a timely manner in distributed systems. We will need to give up synchronization or atleast make its role less critical when dealing with distributed systems. Let me motivate this via an example.

Let’s say I have an email program. In fashionable style, I have constructed a view script which inspects the object store to find a list of emails and emits a bunch of vector graphic descriptions in the object store. A separate ‘fetch email’ program regularly updates the list of emails which causes my view script to update the graphic descriptions, which I use another program to display on my screen or embed somewhere else. Locally, given that I can do synchronization in a timely manner, I _depend_ on it. I _expect_ that a fetch of email should update my display (there is no other way of knowing whether I really fetched something). However, my mental model with an email program changes when I am fetching email remotely. I _don’t expect_ that my local display will change in a timely manner if I have email remotely available on some email server.
I believe we can use this expectation of ours to keep the local programming model simple (without distributed failure assumptions). The question really becomes, what should we adopt as a good means of sharing / synchronizing data in distributed systems?

With distributed systems, I think we should allow for multiple versions of the same object structure to be present in different nodes. The programming model should encourage working with non-coherent copies of data and use synchronization lazily to arrive at results. The programming model should encourage parallel data fetches with synchronization to resolve the fetched data instead of synchronizing on which data needs to be fetched first and then fetching it (and then fetching the next etc.). Synchronization in distributed systems is done using a combination of polling / pub-sub which will be riddled with delays and failures. The idea is to extract out most of the programming model away from synchronization so that we don’t have to deal with failures there.

I think the runtime to do this would involve immutable objects with global ids and a mechanism to automatically fetch those objects from other nodes in the system. Synchronization primitives use polling / pub-sub underneath to find out which is the most recent version of data, but don’t expose an update locally unless all the related objects with that pub-sub event have also been fetched without failures. The use of immutable objects helps us keep track of updates to large and complex object structures by just adding a few objects to it, not to mention allowing aggressive caching of those objects across the network. This hides network errors behind the runtime, but at the same time gives us a mechanism of doing distributed data synchronization.

This can definitely be built on top of our local object store. IMHO, keeping its model separate from the local programming model will help us get the best of both worlds.

This entry was posted in Computing, Internet. Bookmark the permalink.

2 Responses to Case against the filesystem

  1. amazing post you got here, thanks for making it available!

  2. Hi Camilo. The plugin has not been actively maintained, I’m afraid, due to lack of time. It was clearly never tested with scheduled posts. Click

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s