Friday, March 30, 2007

Unification Filesystems for Disaster Recovery

For critical services that require high availability, there is nothing worse than a corrupted filesystem !
Filesystem corruption happens even with the most robust journaling filesystem.
Checking and repairing the filesystem using standard tools is not a feasible option in most cases, as it takes hours, if not days, to finish, and the data availability is very critical for our business, so our disaster recovery plans never considered offline repair as an option. FreeBSD supports background fsck, which is very good, although our experiments shows unacceptable performance penalty. We have also experienced kernel panics when stressing the filesystem while background fsck is running, and the freebsd mailing list archives shows that we are not alone in this. That was over 6 months ago. Hopefully bugs are fixed by now, but still neither FFS, or UFS2 satisfy our needs. Checking filesystems on LVM snapshots is a very nice way to know how badly is the filesystem corrupted, but LVM does not help in repairing a live filesystem (mounted r/w). It might be very useful, and relatively simple, to make modern filesystems, like XFS, co-operate with volume managers, like LVM, to support efficient background fsck for r/w mounted filesystems. I wish I had the time to work on that. Any kernel hackers out there !
Now, the most practical solution seems to be unification filesystems. Unification filesystems are simply fanout filesystems like unionfs, aufs, and union mounts. With minimal impact on performance, we can overlay a healthy filesystem on top of a readonly mounted, possibly corrupted, filesystem. All writes will go to the healthy filesystem. Since, unification filesystem perform snapshoting at the filesystem level (in contrast to block device level), the manipulation of snapshots is safer. Most unification filesystems support whiteouts for deleted files, and copy-on-write for modified files.
A very simple shell script can merge the snapshots in the background. The script can easily control its pace in merging the snapshots, to balance between time to merge, and impact on performance.
Aufs is a very promising project. There are currently some serious limitations in aufs, but its author, Junjiro Okajima, is very active. Also, aufs was designed mainly for "live cdrom" applications, to give read/write experience for users booting from read-only cdrom. This type of workload is very different from the typical workload of internet servers, but fortunately, only very few design decisions were affected by such difference, and I believe that Junjiro might be interested in supporting other types of workloads.

4 comments:

Anonymous said...

In your post you say "Filesystem corruption happens even with the most robust journaling filesystem."! Well what I would like to know is WHY?

It would seam that the best place to tackle this problem would be the file system itself. How hard can it be to create a "incorruptible" FS? Even it it is non-trivial, the result would be well worth it!

BTW, I believe I have the honor to post your first comment?

Anonymous said...

I am unsure if i understood all of your post, but don't blame me, you are the 4th result on google.

I am here for talking off-topic,
did you heard about Phantom OS (please don't look for it in wikipedia), here is its FAQ, and here an article about it on the register.
the most promising thing about it is its filesystem, or actually, its lake of a file system.
it has no files, only objects of data, OS state, all objects and everything resides in the userland memory is mapped to disk and snapped frequently, so even after power failure their should be no loss neither in data nor in state of OS and applications.
I only wonder how this behaviour could affect the platform performance compared to a usual OS on the same hardware, another point is how this weird filesystem can treat large databases.

BTW, i have the honor to post the second, after 1.25 year of time delay :)

Anonymous said...

and oh, i forget to say, mabrouk 3al PhD :)

Anonymous said...

Your blog keeps getting better and better! Your older articles are not as good as newer ones you have a lot more creativity and originality now keep it up!