Friday, March 30, 2007

Unification Filesystems for Disaster Recovery

For critical services that require high availability, there is nothing worse than a corrupted filesystem !
Filesystem corruption happens even with the most robust journaling filesystem.
Checking and repairing the filesystem using standard tools is not a feasible option in most cases, as it takes hours, if not days, to finish, and the data availability is very critical for our business, so our disaster recovery plans never considered offline repair as an option. FreeBSD supports background fsck, which is very good, although our experiments shows unacceptable performance penalty. We have also experienced kernel panics when stressing the filesystem while background fsck is running, and the freebsd mailing list archives shows that we are not alone in this. That was over 6 months ago. Hopefully bugs are fixed by now, but still neither FFS, or UFS2 satisfy our needs. Checking filesystems on LVM snapshots is a very nice way to know how badly is the filesystem corrupted, but LVM does not help in repairing a live filesystem (mounted r/w). It might be very useful, and relatively simple, to make modern filesystems, like XFS, co-operate with volume managers, like LVM, to support efficient background fsck for r/w mounted filesystems. I wish I had the time to work on that. Any kernel hackers out there !
Now, the most practical solution seems to be unification filesystems. Unification filesystems are simply fanout filesystems like unionfs, aufs, and union mounts. With minimal impact on performance, we can overlay a healthy filesystem on top of a readonly mounted, possibly corrupted, filesystem. All writes will go to the healthy filesystem. Since, unification filesystem perform snapshoting at the filesystem level (in contrast to block device level), the manipulation of snapshots is safer. Most unification filesystems support whiteouts for deleted files, and copy-on-write for modified files.
A very simple shell script can merge the snapshots in the background. The script can easily control its pace in merging the snapshots, to balance between time to merge, and impact on performance.
Aufs is a very promising project. There are currently some serious limitations in aufs, but its author, Junjiro Okajima, is very active. Also, aufs was designed mainly for "live cdrom" applications, to give read/write experience for users booting from read-only cdrom. This type of workload is very different from the typical workload of internet servers, but fortunately, only very few design decisions were affected by such difference, and I believe that Junjiro might be interested in supporting other types of workloads.

Saturday, March 24, 2007

Distributed Service Platform (The big picture)

Internet service applications, among other type of scalability demanding applications, usually suffer from limitations in the design of the current operating systems. In recent years, Internet service applications received considerable attention from researchers, aiming to study the challenges facing Internet services to support large number of concurrent requests. A common conclusion for those studies is that the existing operating system designs are ill-suited for the needs of Internet server applications, and consequently different operating system level mechanisms have been proposed to address those limitations. These mechanims include scalable event delivery (FreeBSD Kqueue, and Linux epoll interfaces), Zero copy sockets, direct I/O, and many others. These proposed solutions greatly enhances the ability of the operating system to cope with the type of load experienced by busy Internet servers, although they are limited to a single instance of an operating system.

On the other hand, parallel and distributed systems has been studied extensively by computer scientists in the contexts of high performance computing, high availability, fault tolerance, and resources sharing. Those studies proved that distributed systems can address the scalability, and performance problems very effectively. This makes distributed architectures very attractive to Internet server applications, although, according to my knowledge, no attempts have been made to formally study the system level mechanisms ( operating system kernel, and system software ) that enables Internet server applications to utilize distributed systems resources to achieve their massive concurrency and scalability needs. The main barriers against utilizing distributed architectures in Internet server applications are 1) the complexity of writing distributed applications, 2) the diminishing benefit of adding distributed resources due to existing operating system mechanisms that may limit the overall system scalability, and 3) the unpredictability of resources utilization due to the lack of control over system level mechanisms from the application side.

In my opinion, there is a great practical potential for a service platform that precisely addresses the needs for complex Internet server applications, and scales beyond the limits of a single physical machine, while still providing a simple and convenient programming model that fits nicely in existing software development processes.

Sunday, March 18, 2007

About Me !

I am currently a PhD student at Virginia Tech University. I am doing my research in the overlap of three areas: distributed operating systems, software engineering, and pervasive computing. My PhD is about resources engineering in highly distributed platforms like sensor-actuator networks. I am basically building a distributed middle-ware that addresses the complexities of querying and managing highly distributed resources and provides system level support for mobility and new interaction models for future applications.
I am also a Linux/opensource fan. I have been using Linux since 1996 for fun and profit. I have done most of my software development work on Linux and FreeBSD. I am mainly interested in using Linux for building systems that achieves high-availability, manageability, scalability, faul-tolerance, and security. My masters thesis was about network based disk transaction mirroring for Linux. That was my first experience with Linux kernel programming (real fun) and filesystems development. I am also interested in the Arabization of X11 and different toolkits, and Linux based IP telephony and instant messaging system. My Bsc graduation project (Penguin Phone) was actually a presence-based softphone/IM for Linux based on GTK.
Apart from academia, I am the co-founder and chief technology officer of Gawab.com. Gawab is currently the largest email service provider in the middle-east. I started the development of Gawab on 1999 with the objective of building a highly scalable email hosting infrastructure, capable of serving millions of users and thousands of domains, in addition to a feature rich webmail and web based management console. In 2000, Gawab.com was officially launched as a free email provider for internet users, few months later Gawab.net was launched as an email outsourcing solution provider targeting internet portals and businesses. Today, Gawab serves over 4.5 million email users, and hosts the email service for over 25,000 internet domain.
When launched, Gawab free service attracted internet users at a rate that exceeded our imagination. Our main challenge at that time was to keep the service operation cost at minimum to continue providing high quality service free of charge to internet users. Thanks to Linux, opensource libraries and tools, distributed systems research, and cheap commodity hardware, Gawab was able to do it and overcome all obstacles to growth. Gawab was the first email provider on the internet to provide free 2GB mailboxes. It was also one of the first email providers to build a n SPA (single page application) webmail based on the AJAX technology. As a CTO, my role in Gawab is to direct the technical development, and educate developers working on both the system and the application levels. Gawab has been (and still is) the main inspiration for my research. The open research projects at Gawab include:
-ASML: Advanced Storage Management Layer. This project aims to study the design challenges of distributed persistent storage management. The objective is to design a modern block level distributed storage manager that can be used to build distributed filesystems on top of it.
- SMFS: Simple Mailbox File System . SMFS is a distributed fault tolerant filesystem designed to store user mailboxes efficiently. It is built on top of ASML
- Clusterboost: is a distributed shared memory (DSM) and RPC system design with the objective of efficiently utilizing the memory and cpu of clustered machines.
- DCACHED: is a distributed memory caching for objects , built on top of clusterboost and provide an OODBMS interface with simple query language.
- GABS: Is an anti-abuse/anti-spam system, that aims to slow down (or completely stop) email traffic coming from spam zombies without overloading front end servers. GABS stops zombie based DDOS (distributed denial of service) attacks, without affecting legimate email traffic or overloading servers.
On 2004, I was invited by "indiana center for database systems" at Purdue University to join the database systems research team for a semester. That was my first visit to the US. I have worked on two research projects:
1) SP-GiST : Space Partitioning Generalized Index Structure Trees. I have architected, designed and implemented a new index access method for the popular PostgreSQL database to allow using non-traditional indexes based on the SP-GiST framework.
2) Nile : Nile is one of the early attempts to build an advanced data-stream management systems. I contributed the design of the scheduler responsible for the coordination between different stream operators executing a certain query plan.sweet little girl.