Monday, July 26, 2010

RAID Group Options and Considerations

When discussing storage solutions, one of the most frequently asked questions is when to use which RAID type. First the caveats. Some array manufacturers use a modified version of the standard RAID types or have an architecture that lends itself to one type or another. In general though, these guidelines are applicable.

The most common RAID configurations in use today are RAID 10, 5 and 6. Three factors need to be considered when determining which RAID type is most appropriate for your needs; Performance, Availability and Price.

RAID 10 – Mirroring and striping. Provides the best IOPS performance and is typically used for OLTP databases. Database vendors typically recommend RAID 10 because it provides the best IOPS performance. From an availability standpoint, RAID 10 can survive multiple drives failures as long as two disks from the same RAID 1 group are not lost. RAID 10 is the most expensive RAID configuration with 50% of the capacity being dedicated to protection.

RAID 5 – Block Level Striping with distributed parity. RAID 5 is the workhorse of most storage environments. Many storage vendors have made enhancements to their solutions such as dedicating a processor to perform parity calculations and using intelligent caching algorithms that perform full stripe writes whenever possible - minimizing the performance impact associated with parity calculations. RAID 5 configurations can withstand a single drive failure. As disk drive capacities increase there is more and more debate about the risk of data loss associated with RAID 5, in particular with slower SATA devices.

More information on this can be found in this article by Adam Leventhal.

From a cost perspective RAID 5 is very attractive since only one drive of capacity is used for protection. On HDS storage systems 7+1 RAID 5 groups are standard on their enterprise systems and 8+1 RAID 5 groups are most common on the midrange solutions.

RAID 6 – Block Level Striping with double distributed parity. RAID 6 incurs the biggest performance penalty from parity calculations. For write operations the performance impact is frequently between 25% and 30% compared to RAID 5. This penalty may prevent RAID 6 from consideration for environments requiring high levels of performance. A RAID 6 array group can withstand up to 2 simultaneous disk drive failures. The cost of RAID 6 depends on the size of the array group. Some maintain that since you can increase the number of data drives in the array group that the cost of RAID 5 and RAID 6 are the same. For example a RAID 5 8+1 array group is the same relative cost as a RAID 6 16+2 array group. This is of course manufacturer dependent and you need to understand how what you have actually works.

RAID 6 is highly recommended for SATA drives due to their large capacities, slower speeds and lower MTBF ratings.

So in short, here is what we recommend.
• RAID 6 for SATA drives
• RAID 10 for applications with the highest IOPS requirements
• RAID 5 for SAS\FC drives to support most workloads. As drive sizes increase

RAID 6 may become the preferred option.

A thorough overview of RAID technologies is provided on Wikipedia.


Friday, July 23, 2010

Migrating volumes with Linux LVM

Client has existing data on a LUN provisioned from an old array that is to be decommissioned. The data must be migrated online to a new LUN. However, the old LUN must be retained intact so as to facilitate a rollback in case the new array doesn't work right.

This situation isn't complex at all, except for that second keep the original LUN intact and consistent. That means a simple pvmove operation is out of consideration. So let's see what else we can do.

The orginal situation:

growler / # df -h /mnt/prod
/dev/mapper/prodVG-prodLV 2.0G 73M 1.9G 4% /mnt/prod

growler / # ls /mnt/prod/
rtv10n1.pdf rtv2n3.pdf rtv4n2.pdf rtv6n1.pdf rtv8n21.pdf
rtv1n1.pdf rtv2n4.pdf rtv4n3.pdf rtv6n2.pdf rtv9n1_HR.pdf
rtv1n2.pdf rtv3n1.pdf rtv4n4.pdf rtv6n4.pdf rtv9n2 LR-web.pdf
rtv1n3.pdf rtv3n2.pdf rtv5n1.pdf rtv7n2.pdf RTVOL6N3.pdf
rtv1n4.pdf rtv3n3.pdf rtv5n2.pdf tv7n3v10.pdf
rtv2n1.pdf rtv3n4.pdf rtv5n3.pdf rtv7n4.pdf
rtv2n2.pdf rtv4n1.pdf rtv5n4.pdf rtv8n11_web.pdf

growler prod/ # md5sum * > /root/orig.md5sum

growler / # lvm
lvm> lvdisplay -m
--- Logical volume ---
LV Name /dev/prodVG/prodLV
VG Name prodVG
LV UUID K1ni9A-Q1qU-8xD1-Po4g-i0Y9-tYQN-C34riv
LV Write Access read/write
LV Status available
# open 1
LV Size 1.95 GiB
Current LE 499
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:1
--- Segments ---
Logical extent 0 to 498:
Type linear
Physical volume /dev/loop1
Physical extents 0 to 498

lvm> vgdisplay
--- Volume group ---
VG Name prodVG
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 1.95 GiB
PE Size 4.00 MiB
Total PE 499
Alloc PE / Size 499 / 1.95 GiB
Free PE / Size 0 / 0
VG UUID toXyOv-6YX2-te1E-28fS-1kFh-KtGo-XEGI5S

lvm> pvdisplay
--- Physical volume ---
PV Name /dev/loop1
VG Name prodVG
PV Size 1.95 GiB / not usable 4.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 499
Free PE 0
Allocated PE 499
PV UUID 3aOE5g-FyCf-z38s-NV7W-XY29-6E5X-8v7N0g
"/dev/loop2" is a new physical volume of "1.95 GiB"
--- NEW Physical volume ---
PV Name /dev/loop2
VG Name
PV Size 1.95 GiB
Allocatable NO
PE Size 0
Total PE 0
Free PE 0
Allocated PE 0
PV UUID GqZgUS-D0wj-6DHj-2lFs-YTqV-MJLy-JmlP4x

Demonstration of Solution

/dev/loop1 is playing the part of the old LUN that is being decommissioned. /dev/loop2 is playing the role of the LUN provisioned from the new array. I'm using loopback devices for two reasons: first, growler doesn't have any extra spindles and second, using loopback devices will hopefully protect anyone who is using this blog as a recipe from inadvertently destroying their live volumes.
The next step is to extend the existing volume group onto the new loop2 device:
lvm> vgextend prodVG /dev/loop2
Volume group "prodVG" successfully extended

With that done, now we can convert the existing logical volume prodLV into a mirror with two legs, one of which is on the new disk. LVM gives you a helpful progress report. Note that I'm using a memory-based sync log (corelog option) instead of a bitmap logging volume. Since this isn't intended to be a long term mirror, it's safe and faster to use a memory log instead of the more traditional logging volume.
lvm> lvconvert -m1 --corelog prodVG/prodLV /dev/loop2
prodVG/prodLV: Converted: 31.3%
prodVG/prodLV: Converted: 35.1%
prodVG/prodLV: Converted: 38.3%
prodVG/prodLV: Converted: 42.3%
prodVG/prodLV: Converted: 45.3%
prodVG/prodLV: Converted: 47.5%
prodVG/prodLV: Converted: 49.3%
prodVG/prodLV: Converted: 51.5%
prodVG/prodLV: Converted: 54.5%
prodVG/prodLV: Converted: 58.5%
prodVG/prodLV: Converted: 62.3%
prodVG/prodLV: Converted: 65.7%
prodVG/prodLV: Converted: 70.1%
prodVG/prodLV: Converted: 73.3%
prodVG/prodLV: Converted: 77.2%
prodVG/prodLV: Converted: 80.8%
prodVG/prodLV: Converted: 84.4%
prodVG/prodLV: Converted: 88.0%
prodVG/prodLV: Converted: 92.6%
prodVG/prodLV: Converted: 97.4%
prodVG/prodLV: Converted: 100.0%
Logical volume prodLV converted.

And now we can see prodLV is mirrored across the two disks:
lvm> lvdisplay -m
--- Logical volume ---
LV Name /dev/prodVG/prodLV
VG Name prodVG
LV UUID K1ni9A-Q1qU-8xD1-Po4g-i0Y9-tYQN-C34riv
LV Write Access read/write
LV Status available
# open 1
LV Size 1.95 GiB
Current LE 499
Mirrored volumes 2
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:1
--- Segments ---
Logical extent 0 to 498:
Type mirror
Mirrors 2
Mirror size 499
Mirror region size 512.00 KiB
Mirror original:
Logical volume prodLV_mimage_0
Logical extents 0 to 498
Mirror destinations:
Logical volume prodLV_mimage_1
Logical extents 0 to 498

This next step turns off the mirror leg that is writing to the original LUN. I've chosen to do this 'hot' with an active filesystem. Whether this is wise or not depends on the type of filesystem and type of workload. Even here, it creates a couple extra steps. In a perfect world, I'd take a 10 second outage to dismount the volume, split the mirror, and remount the volume.
lvm> lvconvert -m0 prodVG/prodLV /dev/loop1
LV prodVG/prodLV_mimage_0 in use: not deactivating

Now a cleanup step. LVM won't automatically remove the mirror image devices because the volume was active when I split it. So I get to do it manually. It'd be a good idea to do a lvdisplay -am before you do the lvremove, to ensure all the extents are where they're supposed to be, and that prodVG/prodVL is showing mapped exclusively to loop2.
lvm> lvremove -f prodVG/prodLV_mimage_0
Do you really want to remove active logical volume prodLV_mimage_0? [y/n]: y
Logical volume "prodLV_mimage_0" successfully removed
lvm> lvremove -f prodVG/prodLV_mimage_1
Do you really want to remove active logical volume prodLV_mimage_1? [y/n]: y
Logical volume "prodLV_mimage_1" successfully removed

OK, now we can split the volume group apart. Afterwards, we recreate the logical volume in it's new volume group, on exactly the same extent boundaries. Notice the -l499 parameter. 499 was the "Current LE" size in the very first lvdisplay that I got.
lvm> vgsplit prodVG backoutVG /dev/loop1
New volume group "backoutVG" successfully split from "prodVG"
lvm> lvcreate -l499 -n backoutLV backoutVG
Logical volume "backoutLV" created

Now the last step is to fsck the backout device. This step is only necessary because I split the mirror while it was mounted.
growler linux # fsck /dev/backoutVG/backoutLV
fsck from util-linux-ng 2.17.2
fsck.jfs version 1.1.14, 06-Apr-2009
processing started: 7/23/2010 15.27.53
Using default parameter: -p
The current device is: /dev/mapper/backoutVG-backoutLV
Block size in bytes: 4096
Filesystem size in blocks: 510976
**Phase 0 - Replay Journal Log
Filesystem is clean.

And just to prove everything is copesetic, I will mount the backout volume and compare the md5sums of it's contents, and the current production volume contents, with the checksum file I took at the beginning of this exercise:
growler linux # mount -t jfs /dev/backoutVG/backoutLV /mnt/oldLun
growler prod # cd /mnt/oldLun/
growler oldLun # md5sum --quiet -c /root/orig.md5sum
growler oldLun # cd ../prod
growler prod # md5sum --quiet -c /root/orig.md5sum

Any bit errors would have complained. But I got no errors, and LVM is happy. Eventually, I'll tear down backoutLV and backoutVG and remove the loop1 device from LVM. Then at long last, I can disconnect and decommission the old array.

Monday, July 19, 2010

Teaching VMs to Share and Share Alike

I've always marveled at how a child with no interest in a toy must have it the moment another child picks it up.

My vSphere guest VMs refused to share a raw device mapped LUN that we use for in-band management of HDS arrays. My current version (4.0.0 build 261574) refuses to even allow me to present the same raw LUN to multiple VMs, something I know prior versions allowed me to do so long as only one of them used it at a time. While this works, it's a pain to manage in a dynamic lab environment where anyone might need to use one.

Enter SCSI Bus Sharing, allowing multiple VMs to um, share a SCSI bus. This feature is more traditionally used for clustering VMs, but it turns out to suit my needs quite well.

There are two modes for SBS: virtual sharing and physical sharing. Virtual sharing means VMs on the same host can share the SCSI bus; physical sharing means VMs across servers can share the bus.

Unfortunately, neither mode will allow VMotion of a running VM (or suspended; see below), hence I probably wouldn't recommend this for general production use.

But for lab use it makes a lot of sense. Here's how I recommend doing it.
  1. Remove the command device RDM from any guest OS configuration(s).
  2. In the VI Client Datastores view, browse to a shared datastore and create a folder (ie, "cci" or "cmd-devs")
  3. In the VI Client Hosts and Clusters view, browse to a host's Storage Adapter configuration and find the raw LUN to which you want to map. Right-click the LUN and select "Copy identifier to clipboard"
  4. Open an SSH session to an ESX host and cd into /vmfs/volumes/'datastore'/'folder'
  5. Create a raw device map: vmkfstools -createrdm /vmfs/devices/disks/ :
    vmkfstools --createrdmpassthru /vmfs/devices/disks/naa.60060e800564a400000064a4000000ff uspvm_cmddev_00ff.vmdk
  6. Back in the VI Client, go to VMs and Templates and select your first VM and "Edit Settings..."
  7. "Add..." a new hard disk. Select "Use existing virtual disk" and when prompted browse to the datastore and folder where the command device vmdk is located.
  8. Map the RDM to an unused SCSI node (X:Y where X is the SCSI controller and Y is the SCSI bus ID of the new hard disk). Most commonly 1:0 should be available, but you need something other than SCSI controller 0.
  9. Mark the new disk as Independent and select Persistent mode.
  10. Click OK to close the Virtual Machines Properties dialog.
  11. Re-open the VM Properties dialog box and select the new SCSI controller. On the right-hand side select the Physical bus sharing mode.
  12. Close the dialog box and restart your VM.
Repeat as necessary. Remember you will be unable to VMotion these guests while they are running. VI Client will report the guest has a SCSI controller engaged in bus sharing. (Interestingly, I was able to VMotion a suspended guest, but the guest automagically migrated back to the original host when I powered it back on.)

Sunday, July 11, 2010

What on earth is ssd with .t in sar -d output?

We've been working with a customer on performance tuning some Solaris 10 systems using MPXIO, and I got an email on Friday asking about the device names.

If you don't know, sar -d uses the driver name and instance number for its output. So instead of the /dev/dsk/cXtYdX syntax we all know and love, you get something like ssd123. There are plenty of scripts out there that will translate between the two (see the bottom of this post for yet another) but in this case the customer sent me some output and asked what the ",{letter}" and ".t" at the end of the ssd name meant.

The ",{letter}" syntax is familiar - basically the letter correspond to slices on the disks, with a=0, b=1, and so on. I wasn't familiar with the ".t" syntax though and spent an embarrassing amount of time trying to figure out what was going on.

Turns out that sar limits device names to 8 characters (see here line 662 for a reference). If you dig a little deeper, though, using "kstat -p" you'll see that the full device name is actually ssdX.tY.fpZ. Turns out that the "t" stands for target - which seems obvious in retrospect.

Oh and before there's a dogpile, yes this seems custom built for dTrace - but the sar data is handy.

Here's a script to translate ssd #'s to cXtYdZ #'s if you've got path_to_inst and ls-l_dev_rdsk.out (which are conveniently available in an explorer):

# - takes the path_to_inst and a long directory
# listing on /dev/rdsk and maps sd (and ssd)
# #'s to their corresponding cXtYdZ numbers
# Expects both the path_to_inst (named path_to_inst)
# and the directory listing (named ls-l_dev_rdsk.out)
# to be in the current directory
open INPUT, qq{./path_to_inst} || \
die qq{Can't open path_to_inst!\n};
while (<INPUT>) {
if (!($_ =~ /^#/)) {
($devicepath, $instance, $driver) = split;
close INPUT;
open INPUT, qq{./ls-l_dev_rdsk.out} || \
die qq{Can't open ls-l_dev_rdsk.out!\n};
while (<INPUT>) {
if (($_ =~ /^lrwxrwxrwx/) && ($_ =~ /:a,raw/)){
($perms, $size, $owner, $group, $linkcount, $month, \
$day, $year, $ctd, $arrow, $devicepath)=split;
@dev=split(/:/, $devicepath);
pop @dev;
foreach $devicepath (sort {$first=$driver_to_path{$a};
return $first <=> $second;} \
(keys(%ctd_to_path))) {
print qq{$driver_to_path{$devicepath}, \
$ctd_to_path{$devicepath}, $devicepath\n};