Sunday, May 3, 2009

Real world goodness with ZFS

Not long ago I was working on a customer environment where they wanted a NetBackup hardware refresh. To save time, I won't go into details, but the end result was a Solaris (SPARC) media server powered by a T5220 (32GB ram, 8 core) and a full complement of dual port 4Gb HBAs. The storage was a ST6540 with about 200TB of 1TB SATA drives and configured in RAID5. The media server was given (60) 1.6TB LUNs. The initial configuration was three ZFS pools, each with 20 LUNs (no ZFS redundancy, meaning 32TB RAID0 stripes). Not the most optimal in resiliency, true; but this is a back-up server and we understood the risk level; we wanted capacity and were willing to risk losing a pool should RAID5 fail us.

We fired off a series of backup jobs, and were immediately able to run about 5 to 6x the number of jobs that were previously possible on the old hardware. The T5xx0 series systems absolutely SCREAM as NetBackup media servers due to the I/O capabilities of the hardware, and the threading ability of the CPU, they can outperform the M5000 and do it at a fraction of the cost. But on day two or three, CRASH! The Media server goes down hard, and requires a power cycle. Upon rebooting (took about 3 minutes) we found that our paths to storage went down and ZFS did what it was supposed to (PANIC the box). Well, this goes on for a week or so, and the customer decides they want to try Veritas Storage Foundation to see if that resolves it, thinking that this is a ZFS issue because "ZFS is causing the PANIC". So fast forward a couple days running on Veritas, and CRASH! It happens again. This time however, it takes about 6 hours to FSCK the 100TB of storage before we can get the host back online and start backups again. This gets more fun, because the crashes become more often and after a week we are spending about 18 hrs a day rebooting and doing FSCK, and practically getting NO backups.

Turns out, the ST6540 had controller issues, and it was resolved after a couple swap outs, some parity group rebuilds (because the controller swap had its own issues). But the takeaway I got from this is in an unstable environment the true power of ZFS shines even if you don't take advantage of RAIDZ, or any of the other plentiful features. Why would you run anything else, especially for a non-clustered file system requirement? Really, I'm asking! Maybe I'm missing something. It's free, its OPEN, and if it's the default for the new MACOS, it can't be bad (this coming from a Solaris/Linux/Windows at home guy too).