I came across the following project GLScube on “structured semantic storage” which I found very interesting. GLS is a storage system for Linux which does away with traditional filesystem concepts and introduces tags and hierarchically organized “virtual collections” (of tag based search predicates) for all filesystem data organization. While this is a great idea, one should not be carried away with buzzwords and throw out the established filesystems. It is very easy to see how this can be easily “emulated” using filesystems. It is as simple as using a “tags” folder somewhere in the filesystem and hard linking the “tagged” files into those folders. Virtual collections as system directories might require some kernel magic (like linux FUSE) but essentially it retains a hierarchy of tag based search predicates. This could be implemented in the user space using a simple text file which stores the names of these virtual collections and their tag based predicates in some format. A tool (commandline or graphical) could be used to show these vitual collections and list the files in the respective tag directories.
It should be noted that keeping the files in the respective tag directories also helps cluster the files under the same tag around the same portion of the disk for faster access. GLS might be able to pull off some more advanced clustering schemes with the overlap of files between tags. However, note that a filesystem with hard links doesn’t lack any information for similar clustering opportunities.
Content based search is a totally different beast. Spotlight (Mac OS X) and Beagle (Linux) use kernel driven filesystem notifications to find recently updated files and “crawl” over them to update a database with metadata for later search and retrieval. There are several engineering issues with maintaining the “crawl” as a strictly background operation and keeping the prefix based search on the database fast (for search-as-you-type applications).
Personally I find the UNIX hierarchical filesystem perfect. Tagging is like restricting a perfectly good idea. Perhaps the idea of keeping the users freedom on determining the filesystem layout ought to ge given a more serious thought.
A content based search engine over the UNIX filesystem should be more than adequate for needs of an average to power computer user.
An interesting dimension to the storage problem is “typing” the document (with say XML Schemas). Besides helping with “crawling” over the content of the document, I would like to mention another interesting thing one might do with it. In fact I had a post on the idea (XVM) some time ago.