Storage Meat: 2011

Friday, October 28, 2011

Storage MBps Performance Concepts

We recently had a reader post a question about the other aspect of storage performance – MBps. In most cases my clients are concerned with IOPS more so than transfer rate so it doesn’t come up that often. When it does it normally gets back to economics, applications such as streaming video, audio or even backups tend to be sequential and require a lot of capacity. As with IOPS the higher RPM drives tend to outperform the lower RPM drives but the difference in not nearly as substantial as it is with IOPS.

Continue Reading >

Monday, October 24, 2011

Cisco MDS switch interoperability with Brocade Access Gateway and Qlogic Adapters

During a recent SAN switch migration from McData directors to Cisco MDS 9509 switches, I came across two interoperability issues. The first issue involves certain Qlogic adapters connected to the Cisco MDS switches where the area portion of the Fiber Channel ID (FCID) is the same for the initiator and target. The second issue has to do with internal routing in the Brocade 4Gb switches running in "Access Gateway" mode. In my case the Brocades were embedded in blade chassis, but they could be stand alone. Note, the 8Gb version of the Brocade switches did not exhibit the routing issue.

Continue Reading >

Monday, September 19, 2011

Recovering an NTFS Boot Sector in Symantec Storage Foundation for Windows

One evening, I received a call from a customer who ran into an issue with Storage Foundation for Windows. They ran into a SFW bug while trying to shrink volumes for a disk space recovery project. FYI, unlike UNIX, SF volumes in Windows need to be offline before shrinking the volume. Long story short, the shrink process ended up corrupting the NTFS boot record on a 1.3TB volume. Even though the vxprint output showed the volumes ENABLED and ACTIVE with the correct volume boundaries, the volume showed up RAW, not NTFS from both VEA console and DISKPART.

Continue Reading >

Monday, September 12, 2011

Integrating DB2 Advanced Copy Services with NetApp Snapshots

IBM provides integration with NetApp snapshots via their Advanced Copy Services (ACS). ACS is available for DB2 on release 9.5 and newer. ACS supports NAS or SAN attached NetApp subsystems when running under AIX. Only NAS attachment is supported under Linux. ACS allows the DBA to issue backup & restore directives from the DB2 environment without involving the storage administrator. These backup and restore directives manipulate snaphots for the appropriate volumes on the NetApp filer. One caveat to using ACS is the 2 snapshot (backup) limit if you do not own a full TSM license. A full TSM license can be very pricy. Due to the 2 snapshot limit, I decided to use a hybrid approach where ACS provides two daily backups for quick recovery in conjunction with the traditional file based backup for longer retention periods.

Setup of ACS is fairly easy in either a Linux or AIX environment. ACS uses RSH to communicate with the NetApp filer, so the RSH service must be enabled on the filer (options rsh.enable on). The database server can be included in the /etc/hosts.equiv file on the filer so that a password is not needed. ACS is intelligent enough to correlate the database and log file systems/volume groups on the server to the corresponding volumes/LUNs on the filer.

During testing under AIX, I found an issue with the setup.sh script where the acscim program caused the script to fail. The acscim module is used to communicate to IBM storage such as the DS series and requires supporting libraries that were not available on my system. I commented out the acscim check and the setup script completed normally. The setup script needs to be executed as root, but from the database instance user's home directory. The database instance user in the example below is db2int2.

ACS is installed under the database instance user's home directory. Go into the ACS directory and edit the setup.sh script to comment out the acscim binary check.

bash-3.00# cd /home/db2int2/sqllib/acs

bash-3.00# grep cim setup.sh

checkbin ${INST_DIR}/acs/acscim

enableSUID ${INST_DIR}/acs/acscim

bash-3.00# vi setup.sh

bash-3.00# grep cim setup.sh

# checkbin ${INST_DIR}/acs/acscim

enableSUID ${INST_DIR}/acs/acscim

Execute the setup.sh script to provide the necessary parameters to configure ACS. I chose defaults for most of the questions. The ACS_REPOSITORY needs to be set to desired directory path which will be created by the script. The COPYSERVICES_HARDWARE_TYPE is either NAS_NSERIES or SAN_NSERIES under AIX or NAS_NSERIES for Linux.

bash-3.00# pwd

/home/db2int2/sqllib/acs

bash-3.00# ./setup.sh

checking /home/db2int2/sqllib/acs/acsnnas ...

OK

checking /home/db2int2/sqllib/acs/acsnsan ...

OK

Do you have a full TSM license to enable all features of TSM for ACS ?[y/n]

n

****** Profile parameters for section GLOBAL: ******

ACS_DIR [/home/db2int2/sqllib/acs ]

ACSD [57329 ] 57328

TRACE [NO ]

****** Profile parameters for section ACSD: ******

ACS_REPOSITORY [/home/db2int2/sqllib/acs/acsrepository ]

****** Profile parameters for section CLIENT: ******

TSM_BACKUP [NO ]

MAX_VERSIONS [2 ]

LVM_FREEZE_THAW [YES ]

DEVICE_CLASS [STANDARD ]

****** Profile parameters for section STANDARD: ******

COPYSERVICES_HARDWARE_TYPE [SAN_NSERIES]

COPYSERVICES_PRIMARY_SERVERNAME [netappcntl1 ]

COPYSERVICES_USERNAME [root ]

======================================================================

The profile has beeen successfully created.

Do you want to continue by specifying passwords for the defined devices? [y/n]

y

Please specify the passwords for the following profile sections:

STANDARD

master

Creating password file at /home/db2int2/sqllib/acs/shared/pwd.acsd.

A copy of this file needs to be available to all components that connect to acsd.

BKI1555I: Profile successfully created. Performing additional checks. Make sure to restart all ACS components to reload the profile.

After setup is complete, check to see if the daemons are configured to start in /etc/inittab. Note: acsnnas is for NetApp NAS volumes and acsnsan is for NetApp SAN volumes.

bash-3.00# grep acs /etc/inittab

ac00:2345:respawn /home/db2int2sqllib/acs/acsd

ac00:2345:respawn /home/db2int2sqllib/acs/acsnnas –D

OR

ac00:2345:respawn /home/db2int2sqllib/acs/acsnsan –D

Check to see if the daemons are running:

bash-3.00# ps -ef | grep acs

root 12255442 6225980 0 16:25:07 pts/2 0:00 grep acs

db2int2 12451872 1 0 16:24:50 - 0:00 /home/db2int2/sqllib/acs/acsd

db2int2 12451873 1 0 16:26:35 - 0:00 /home/db2int2/sqllib/acs/acsnsan -D

bash-3.00#

Now that ACS is configured, we can perform snapshot backups and restores. As the database instance user execute the following commands to take backups, list backups or restore the database.

Execute the following to take an offline backup:

bash-3.00$ db2 backup db mydb use snapshot

You can specify the "online" parameter to take an online backup of the database:

bash-3.00$ db2 backup db mydb online use snapshot

To list the backups of the database as follows:

bash-3.00$ db2acsutil query

To restore the latest backup:

bash-3.00$ db2 restore db mydb use snapshot

Monday, September 5, 2011

"Unfortunately this has not been documented very well." Fun with VERITAS Cluster Server

So Justin and I are wrapping up a large refresh project for a client where we're moving them from an existing configuration running Oracle on Sun 6800s with VCS over to new M5000s. As you'd expect this includes a migration to Solaris 10 as well as upgrades to VERITAS Volume Manager, File System, and NetBackup (that's Symantec Storage Foundation and NetBackup to some of you).

The application team went through their testing over the last month or so and we completed our VCS test matrix in preparation for cutover. During the cutover, though, we noticed the following message in the alert log:

WARNING:Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.

Thinking that we'd missed a resource control setting somewhere we went through the process of validating those settings. Then, seeing that they looked correct, we asked the DBA to stop and restart the database manually only to find that the error message above didn't appear. Using VCS to stop and start the database would generate this error every time, though.

We opened a case with Symantec and started to troubleshoot. Thankfully we found that in VCS 5.1 SP1 Symantec added a file called vcsenv that hardcodes limits for CPU time, core file size, data segment size, file size, and the number of open file descriptors before we ran out of window for the cutover.

The location and contents of the file are shown below, including where we set the number of file descriptors to 8192.

bash-3.00# cd /opt/VRTSvcs/bin

bash-3.00# more vcsenv

# $Id: vcsenv,v 2.8 2010/09/30 05:45:29 ptyagi Exp $ #

# THIS SOFTWARE CONTAINS CONFIDENTIAL INFORMATION AND TRADE SECRETS OF

# SYMANTEC CORPORATION. USE, DISCLOSURE OR REPRODUCTION IS PROHIBITED

# WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF SYMANTEC CORPORATION.

# The Licensed Software and Documentation are deemed to be commercial

# computer software as defined in FAR 12.212 and subject to restricted

# rights as defined in FAR Section 52.227-19 "Commercial Computer

# Software - Restricted Rights" and DFARS 227.7202, "Rights in

# Commercial Computer Software or Commercial Computer Software

# Documentation", as applicable, and any successor regulations. Any use,

# modification, reproduction release, performance, display or disclosure

# of the Licensed Software and Documentation by the U.S. Government

# shall be solely in accordance with the terms of this Agreement. $

# This is just a sample as to how you can specify various environment

# variables you need to set. Uncomment/add/modify the values as per

# your requirement.

# Specify the default language in which you want to bring up the

# VCS agents.

# LANG=C; export LANG

VCSHOME="${VCS_HOME:-/opt/VRTSvcs}"

#This is required for agents which use dynamic VCSAPI libraries

LD_LIBRARY_PATH="$VCSHOME/lib:$LD_LIBRARY_PATH"

export LD_LIBRARY_PATH

# Setting ulimit.

# Common For Linux, HP-UX, SunOS & AIX

ulimit -t unlimited # CPU Time

ulimit -c unlimited # Core File Size

ulimit -d unlimited # Data Seg Size

ulimit -f unlimited # File Size

ulimit -n 8192 # File Descriptor

if [ `uname` = "AIX" ];then

RT_GRQ=ON

export RT_GRQ

Monday, August 29, 2011

Working with NetBackup CTIME Timestamps in Excel

NetBackup makes extensive use of timestamps for backup job start and stop times, expiration dates, etc. NetBackup appends a timestamp to the client name to create a backup id (myserver_1303936891). Don’t know about you, but I don’t natively understand CTIME timestamps.
To convert between the timestamp and “human readable” time, Symantec provides the bpdbm command with the -ctime flag to convert from NetBackup’s timestamp to the current time zone. For example:

bash-3.00# bpdbm -ctime 1303936891
1303936891 = Wed Apr 27 15:41:31 2011

Some of my favorite NetBackup reports output job start/stop times and other data using CTIME timestamps:

bpmedialist shows the expiration date for tapes
bpimagelist shows the start and stop times for backup images in the catalog, as well as image expiration dates

It’s just not feasible to run bpdbm on every timestamp for output that may contain 10K lines. Excel makes a far better tool.
To convert from CTIME to “Excel” time you need the following information:

The NetBackup timestamp (the number of seconds since 1/1/1970, in GMT)
The number of seconds per day (86,400)
The Excel serial number for 1/1/1970 (25,569)
Your current timezone offset in hours (for US/Central this is currently -5)

Excel date/time stamps are real numbers where the integer portion represents the number of days since 1/1/1900 (where 1/1/1900 is day 1). The decimal portion of an Excel date/time value represents the portion of a single day (ie, 0.5 is 12 hours). Therefore, to convert from CTIME to Excel time use the following formula:

timestamp/86400+25569+(-5/24)

Monday, August 22, 2011

How to Mount Cloned Volume Groups in AIX

Today, most SAN storage vendors provide some kind of volume or LUN cloning capability. The name and underlying mechanics for each vendor differ, but the end result is pretty much the same. They take a primary volume or LUN and create an exact copy at some point in time. NetApp's name for this technology is FlexClone.

Typically, creating a clone of a LUN and mounting the file system on the original server is a trivial process. The process becomes more complex if volume management is involved. Server based volume management software provides many benefits, but complicates matters where LUN clones are used. In the case of IBM's Logical Volume Management (LVM), mounting clones on the same server results in duplicate volume group information. Luckily, AIX allows LVM to have duplicate physical volume IDs (PVID) for a "short period" of time without crashing the system. Not sure exactly what a "short period" of time equates too, but in my testing I didn't experience a crash.

The process to "import" a cloned volume group for the first time is disruptive in that the original volume group must be exported. It is necessary to have the original volume group exported so that the physical volume IDs (PVIDs) on the cloned LUNs can be regenerated. The recreatevg command is used to generate new PVIDs and to rename the volume names in the cloned volume group. Note that the /etc/filesystem entries need to be manually updated because the recreatevg command prepends /fs to the original mount point names for the clones. Once the /etc/filesystem file is updated, the original volume group can be re-imported with importvg.

Subsequent refreshes of previously imported clones can be accomplished without exporting the original because ODM remembers the previous PVID to hdisk# association. It does not reread the actual PVID from the disk until an operation is performed against the volume group. The recreatevg command will change the PVIDs and volume names on the cloned volume group without affecting the source volume group.

Process for initial import of cloned volume group:

Clone the LUNs comprising the volume group
1. Make sure to clone in a consistent state
Unmount and export original volume groups
1. Use df to associate file systems to volumes
2. Unmount file systems
3. Use lsvg to list the volume groups
4. Issue varoffvg to each affected volume group
5. Use lspv to view the PVIDs for each disk associated with the volume groups
6. Remember the volume group names and which disks belong to each VG that will be exported
7. Use varyoffvg to offline each VG
8. Use exportvg to export the VGs
Bring in the new VG
1. Execute cfgmgr to discover new disks
2. Use lspv to identify the duplicate PVIDs
3. Execute recreatevg on each new VG listing all disks associated with the volume group and –y option to name the VG
4. Use lspv to verify no duplicate PVIDs
Import the original volume groups
1. Execute importvg with the name of one member hdisk and the –y option with the original name
2. Mount the original file systems.
Mount the cloned file systems
1. Make mount point directories for the cloned file systems
2. Edit /etc/filesystems to update the mount points for the cloned VG file systems
3. Use mount command to mount the cloned file systems

The subsequent import of a cloned volume group differs in that only the cloned volume group needs to be unmounted and varied offline prior to the clone refresh. Remember the hdisk numbers involved in each clone volume group that is to be refreshed. Once refreshed use exportvg to export the volume group. Afterward, the recreatevg command is issued naming each hdisk associated with the volume group and its previous name. Now the volumes and file systems are available. Prior to mounting, the /etc/filesystem entries need to be updated to correct the mount points.

Process to refresh cloned volume group:

Unmount and vary off the cloned volume groups to be refreshed
1. Execute umount on associated file systems
2. Use varyoffvg to offline each target VG
Refresh the clones on the storage system
Bring in the refreshed clone VGs
1. Execute cfgmgr
  1. Use lspv and notice that ODM remembers the hdisk/PVID and volume group associations
2. Use exportvg to export the VGs noting the hdisk numbers for each VG
3. Execute recreatevg on each refreshed VG naming all disks associated with the volume group and –y option to name the VG to its original name
4. Now lspv displays new unique PVIDs for each hdisk
Mounting the refreshed clone file systems
1. Edit /etc/filesystem to correct the mount points for each volume
2. Issue mount command to mount the refreshed clones

See the example below for a first time import of two cloned volume groups, logvg2 and datavg2, consisting of 2 and 4 disks respectively:

bash-3.00# df

Filesystem 512-blocks Free %Used Iused %Iused Mounted on

/dev/hd4 1048576 594456 44% 13034 17% /

/dev/hd2 20971520 5376744 75% 49070 8% /usr

/dev/hd9var 2097152 689152 68% 11373 13% /var

/dev/hd3 2097152 1919664 9% 455 1% /tmp

/dev/hd1 1048576 42032 96% 631 12% /home

/dev/hd11admin 524288 523488 1% 5 1% /admin

/proc - - - - - /proc

/dev/hd10opt 4194304 3453936 18% 9152 3% /opt

/dev/livedump 524288 523552 1% 4 1% /var/adm/ras/livedump

/dev/pocdbbacklv 626524160 578596720 8% 8 1% /proddbback

/dev/fspoclv 1254359040 1033501496 18% 2064 1% /cl3data

/dev/fspocdbloglv 206438400 193491536 7% 110 1% /cl3logs

/dev/poclv 1254359040 1033501480 18% 2064 1% /proddb

/dev/pocdbloglv 206438400 193158824 7% 115 1% /proddblog

/dev/datalv2 836239360 615477152 27% 2064 1% /datatest2

/dev/loglv2 208404480 195088848 7% 118 1% /logtest2

bash-3.00#

bash-3.00$ umount /datatest2/

bash-3.00# umount /logtest2/

bash-3.00# lsvg

rootvg

pocdbbackvg

dataclvg

logsclvg

pocvg

pocdblogvg

datavg2

logvg2

bash-3.00# varyoffvg datavg2

NOTE: remember the hdisk and vg names for the exported vg's.

bash-3.00# lspv

hdisk0 00f62aa942cec382 rootvg active

hdisk1 none None

hdisk2 00f62aa997091888 pocvg active

hdisk3 00f62aa9a608de30 dataclvg active

hdisk4 00f62aa9a60970fc logsclvg active

hdisk10 00f62aa9972063c0 pocdblogvg active

hdisk11 00f62aa997435bfa pocdbbackvg active

hdisk5 00f62aa9a6798a0c datavg2

hdisk6 00f62aa9a6798acf datavg2

hdisk7 00f62aa9a6798b86 datavg2

hdisk8 00f62aa9a6798c36 datavg2

hdisk9 00f62aa9a67d6c9c logvg2 active

hdisk12 00f62aa9a67d6d51 logvg2 active

bash-3.00# varyoffvg logvg2

bash-3.00# lsvg

rootvg

pocdbbackvg

dataclvg

logsclvg

pocvg

pocdblogvg

datavg2

logvg2

bash-3.00# exportvg datavg2

bash-3.00# exportvg logvg2

bash-3.00#

bash-3.00# exportvg datavg2

bash-3.00# exportvg logvg2

bash-3.00# cfgmgr

bash-3.00# lspv

hdisk0 00f62aa942cec382 rootvg active

hdisk1 none None

hdisk2 00f62aa997091888 pocvg active

hdisk3 00f62aa9a608de30 dataclvg active

hdisk4 00f62aa9a60970fc logsclvg active

hdisk10 00f62aa9972063c0 pocdblogvg active

hdisk11 00f62aa997435bfa pocdbbackvg active

hdisk5 00f62aa9a6798a0c None

hdisk6 00f62aa9a6798acf None

hdisk7 00f62aa9a6798b86 None

hdisk8 00f62aa9a6798c36 None

hdisk13 00f62aa9a6798a0c None

hdisk14 00f62aa9a6798acf None

hdisk15 00f62aa9a6798b86 None

hdisk9 00f62aa9a67d6c9c None

hdisk12 00f62aa9a67d6d51 None

hdisk16 00f62aa9a6798c36 None

hdisk17 00f62aa9a67d6c9c None

hdisk18 00f62aa9a67d6d51 None

bash-3.00#

Notice the duplicate PVIDs. Use the recreatevg command naming all of the new disks in each volume group of the newly mapped clones.

bash-3.00# recreatevg -y dataclvg2 hdisk13 hdisk14 hdisk15 hdisk16

dataclvg2

bash-3.00# recreatevg -y logclvg2 hdisk17 hdisk18

logclvg2

bash-3.00# importvg -y datavg2 hdisk5

datavg2

bash-3.00# importvg -y logvg2 hdisk9

logvg2

bash-3.00# lspv

hdisk0 00f62aa942cec382 rootvg active

hdisk1 none None

hdisk2 00f62aa997091888 pocvg active

hdisk3 00f62aa9a608de30 dataclvg active

hdisk4 00f62aa9a60970fc logsclvg active

hdisk10 00f62aa9972063c0 pocdblogvg active

hdisk11 00f62aa997435bfa pocdbbackvg active

hdisk5 00f62aa9a6798a0c datavg2 active

hdisk6 00f62aa9a6798acf datavg2 active

hdisk7 00f62aa9a6798b86 datavg2 active

hdisk8 00f62aa9a6798c36 datavg2 active

hdisk13 00f62aa9c63a5ec2 dataclvg2 active

hdisk14 00f62aa9c63a5f9b dataclvg2 active

hdisk15 00f62aa9c63a6070 dataclvg2 active

hdisk9 00f62aa9a67d6c9c logvg2 active

hdisk12 00f62aa9a67d6d51 logvg2 active

hdisk16 00f62aa9c63a6150 dataclvg2 active

hdisk17 00f62aa9c63bf6b2 logclvg2 active

hdisk18 00f62aa9c63bf784 logclvg2 active

bash-3.00#

Notice the PVID numbers are all unique now.

remount original file systems

bash-3.00# mount /datatest2

bash-3.00# mount /logtest2

bash-3.00#

create new mount points and edit /etc/filesystems

bash-3.00# mkdir /dataclone1test2

bash-3.00# mkdir /logclone1test2

bash-3.00# cat /etc/filesystems

…

/fs/datatest2:

dev = /dev/fsdatalv2

vfs = jfs2

log = /dev/fsloglv03

mount = true

check = false

options = rw

account = false

/fs/logtest2:

dev = /dev/fsloglv2

vfs = jfs2

log = /dev/fsloglv04

mount = true

check = false

options = rw

account = false

/datatest2:

dev = /dev/datalv2

vfs = jfs2

log = /dev/loglv03

mount = true

check = false

options = rw

account = false

/logtest2:

dev = /dev/loglv2

vfs = jfs2

log = /dev/loglv04

mount = true

check = false

options = rw

account = false

bash-3.00#

Notice the cloned duplicates are prefixed with /fs on the mount point by the recreatevg command. Also the volume names were changed to prevent duplicate entries in /dev. Update /etc/filesysems with the mount points created previously.

bash-3.00# mount /dataclone1test2

Replaying log for /dev/fsdatalv2.

bash-3.00# mount /logclone1test2

Replaying log for /dev/fsloglv2.

bash-3.00# df

Filesystem 512-blocks Free %Used Iused %Iused Mounted on

/dev/hd4 1048576 594248 44% 13064 17% /

/dev/hd2 20971520 5376744 75% 49070 8% /usr

/dev/hd9var 2097152 688232 68% 11373 13% /var

/dev/hd3 2097152 1919664 9% 455 1% /tmp

/dev/hd1 1048576 42032 96% 631 12% /home

/dev/hd11admin 524288 523488 1% 5 1% /admin

/proc - - - - - /proc

/dev/hd10opt 4194304 3453936 18% 9152 3% /opt

/dev/livedump 524288 523552 1% 4 1% /var/adm/ras/livedump

/dev/pocdbbacklv 626524160 578596720 8% 8 1% /proddbback

/dev/fspoclv 1254359040 1033501496 18% 2064 1% /cl3data

/dev/fspocdbloglv 206438400 193491536 7% 110 1% /cl3logs

/dev/poclv 1254359040 1033501480 18% 2064 1% /proddb

/dev/pocdbloglv 206438400 193158824 7% 115 1% /proddblog

/dev/datalv2 836239360 615477152 27% 2064 1% /datatest2

/dev/loglv2 208404480 195088848 7% 118 1% /logtest2

/dev/fsdatalv2 836239360 615477160 27% 2064 1% /dataclone1test2

/dev/fsloglv2 208404480 195744288 7% 114 1% /logclone1test2

bash-3.00#

Monday, August 15, 2011

Virtual Machine Migration Fails

After recently upgrading a customer to vSphere 4.1 Update 1, I received a call from the customer because one of their guest VMs could not be migrated to a different VMFS datastore. The somewhat troubling, “unable to access file” error message indicated a problem with the VMDK for a specific snapshot.

Through an SSH session I validated the snapshot VMDK and VMDK flat file existed in the proper directory. The VMDK flat file had a non-zero size, suggesting to me this might be a configuration problem.

Out of curiosity I examined the snapshot’s VMDK configuration file and noticed the parentFileNameHint entry contained the full pathname and used the VMFS GUID value. Hmm. Is that the proper GUID? No it wasn’t.

Since the VM had other snapshots I reviewed those configuration files as well and noticed they used relative path names for the parentFileNameHint. Could it be that simple?
I edited the snapshot VMDK configuration file and removed the full path qualification.

Problem solved.

In my example above (which was reproduced in my lab), I changed:
parentFileNameHint=”/vmfs/volumes/4ab3ebbc-46f3c941-7c14-00144fe69d58/retro-000001.vmdk"
To:
parentFileNameHint="retro-000001.vmdk"

Monday, August 8, 2011

FCIP considerations for 10 GigE on Brocade FX8-24

While working with a client to architect a new FCIP solution, there were a number of considerations that needed to be addressed. With this particular implementation we are leveraging the FX8-24 blades in DCX chassis and attaching the 10Gbe xge links to the network core.

With the current version of FOS (6.4.1a in this case), the 10Gbe interface is maximized by using multiple “circuits” which are combined in a single FCIP tunnel. Each circuit has a maximum bandwidth of 1Gb. To aggregate the multiple circuits you need the Advanced Extension License. Each of these circuits needs an IP address on either end of the tunnel. Additionally, there are two 10Gbe xge ports on each FX8-24 card; and they require placement in separate VLANs. Be sure to discuss and plan these requirements with your network team.

There are other considerations as well, such as utilizing Virtual Fabrics to isolate and allow the FCIP fabrics to merge between switches or utilizing the Integrated Routing feature (additional licensing) to configure the FCIP tunnels with VEX ports without the requirement of merging the fabrics.

Regardless of the architecture (Virtual fabric vs integrated routing) you will need to configure the 1Gbe circuits appropriately. You will want to understand the maximum bandwidth your link can sustain, and configure the FCIP tunnel in such a way that you consume just under the maximum bandwidth to prevent TCP/IP congestion and sliding window ramp up from slowing down your overall throughput.

In our example, we want to consume about 6 Gbe of bandwidth between their locations. We will need to configure six circuits within the FCIP tunnel, each configured just under the 1Gbe maximum bandwidth setting.

Monday, August 1, 2011

Storage Performance Concepts Entry 6

The Impact of Cache

Although the cache within an array can be used for a number of purposes such as storing configuration information or tables for snapshots and replication its primary purpose is to serve as a high speed buffer area between the hosts and the backend disk. The caching algorithms used by the manufacturers are their “secret sauce” and they aren’t always forthcoming with exactly how they work but at a basic level they all provide the same types of functions.

In our previous entries on storage performance we highlighted the performance differences of RAID 10 versus RAID 5. RAID 10 provides better performance than RAID 5 in particular for random IO because the RAID penalty is lower and given the same amount of usable capacity it will have more underlying RAID Groups and total disk drives.

Despite the performance benefits we deploy RAID 5 much more frequently than RAID 10. The first reason why is obvious, RAID 5 is much less expensive. The second reason is that cache greatly improves performance to the point that RAID 5 is acceptable.

All host write requests are satisfied by cache. Once the IO is stored in cache the array sends an acknowledgement back to the host that initiated the IO to let it know that it was received successfully and that it can now send its next IO request. In the background the array will write or flush the pending write IOs to disk.

How and when these write IOs are flushed depends on the cache management algorithms of the array as well as the current cache workload. Most arrays including the HDS AMS 2000 will attempt to hold off flushing to disk to see if it can perform a full stripe write. This is particularly important in RAID 5 configurations because it means you can eliminate the read and modify operations associated with performing write IO. Details of Read-Modify-Write can be found at the following link.

http://en.wikipedia.org/wiki/Standard_RAID_levels

Although the array is not always able to do full stripe writes at least on the HDS arrays we are often able to use RAID 5 in scenarios where you would typically be looking at RAID 10. This is not to say that RAID 5 is faster than RAID 10 just that with effective cache algorithms RAID 5 may perform better than expected.

It’s interesting that if you google array cache performance you will find a number of articles and blog entries that talk about how cache can negatively impact performance. This is a bit misleading. These entries are focused on very specific use cases and typically involve a single application with a dedicated storage device rather than a shared storage environment like the ones used in most environments. Let’s breakdown how cache is used and the impact associated with various workloads.

How Cache Works

Exactly how cache operates is dependent on the storage array and the particular step by step processes involved are outside the scope of this entry. In any case, what we are interested in is not so much the cache operations of a specific array but rather the general impact of cache in real world workloads. I cannot speak for every array but in the case of HDS enterprise systems their Theory of Operation Guide goes into excruciating detail on cache. If you are looking for this type of detail I suggest you get a copy of this guide or similar information from your array manufacturer.

The performance benefits of cache are based on cache hits. A cache hit occurs when the IO request can be satisfied from cache rather than from disk. The array manufacturers develop algorithms to make the best use of the available cache and to increase the number of cache hits. This is done in a number of ways.

Least Recently Used – Most arrays employ a least recently used algorithm to manage the content of cache. In short the most recent data is left in cache and the least recently used data is removed, freeing up capacity for additional IO.

Locality of Reference – When a read request is issued the data is read from disk and placed into cache. In addition to the specific data requested, additional data said to be nearby is also loaded into cache. This is based on the principal that data exists in clusters and that there is a high degree of likelihood that nearby data will also be accessed.

Prefetching – The array will attempt to recognize sequential read operations and prefetch data into cache.

Applying this to workloads

As we discussed in our previous entry Cache can be thought of as very wide but not very deep, it provides a lot of IOs but doesn’t store a lot of data. This can come into play when you try and determine how to address a performance issues with a storage array. Here is an illustration that should help.

You have an array with 16GB of cache that you are frequently seeing performance issues. You look at performance monitor and determine that the cache has entered write pending mode, meaning that 70% of the user cache has pending write data and the array has begun throttling back the attached hosts.

Should you add more cache? Let’s take a look.

· During the performance degradation the array is receiving 400MBps of write requests

· The array has 16GB of cache and we are considering upgrading it to 32GB.

o 16GB of installed cache results in

§ 8GB Mirrored

§ 70% limit = 5.734GB

o 32GB of installed cache results in

§ 16GB Mirrored

§ 70% Limit = 11.2GB Usable

o Writing data at 400MBps fills up the 16GB configuration in ~15 Seconds

o Writing data at 400MBps fills up the 32GB configuration in ~29 Seconds

Assuming you pay around $1,000 per GB of cache it will cost you $16,000 for the array to slow down to a crawl in 29 seconds rather than 15. This is probably not the solution you or management is hoping for. What may be needed is a combination of approaches.

1. You may still need cache, if the bursts are not sustained the extra cache may be good enough.

2. You may need to partition the cache so that the offending application doesn’t consume all available cache and impact the other hosts sharing the array. The way this works is you dedicate a portion of cache to the LUNs that are receiving the heavy IO, leaving the rest of the cache available for other applications. This will not speed up the application that is performing all of the writes but it will at least keep all of the other applications from suffering.

3. You may need to speed up the disk – the bottom of the funnel in our previous diagram. This can be done by using something like Dynamic Provisioning to distribute the IO across more disks. If Dynamic Provisioning is already in use you may need to add more disk to the pool, each new disk provides additional IOPS.

You may need a combination of all three approaches; more cache, Dynamic Provisioning and more disks. The answer depends on the workload. Cache is an advantage primarily when the IO profile consists of short bursts of random IO. This is most common in OLTP environments. Cache does not help and may even hinder performance in the cases of long sequential IO.

Monday, July 25, 2011

Demonstration of Hitachi Dynamic Tiering

On the latest model of the Hitachi Data Systems enterprise storage array, the Virtual Storage Platform (VSP), HDS has included the capability to dynamically tier data at a sub-LUN (page) level. Hitachi Data Systems Dynamic Tiering (HDT) is a technology that enables optimization of an HDS Dynamically Provisioned (HDP) Pool by allocating highly referenced pages to higher tiers of storage. HDT will periodically move pages up or down the tiers depending upon access patterns.

For the purpose of this exercise, we will create a multi-tiered HDP pool. Within the HDP pool, there will be two tiers, one SAS and one SATA. From this pool, we present three volumes to a Windows server and populate each of these volumes with static data. We will initially populate all three volumes with static data, filling Tier1 (SAS) first, with the remainder of data spilling over to Tier2 (SATA). We will confirm our Tier1 is filled with static data by viewing the Tier Properties. Next, we will create what we dub, “active data” with IOMeter and generate I/O to test files on each of the three volumes. After several HDT cycles, we will revisit the Tier Properties to observe the affect that HDT has on highly referenced pages.

Storage tiering is not a new concept and there are a variety of ways to tier storage. Prior to HDT, HDS provided the capability to tier at the LUN level with Tiered Storage Manager. Other ways of tiering include file archiving and virtualizing storage. However, HDT can automatically tier data at a finer granularity without the need to set up policies or classify the data.

Monday, July 18, 2011

Hitachi Dynamic Provisioning (HDP) in practice

We've talked about HDP on the blog a few times before (here and here, for example). And with the advent of the VSP, we've moved into a world where all LUN provisioning from HDS arrays should be done using HDP.

In brief, HDP brings three different things to the table:

Wide striping - data from each LUN in an HDP pool is evenly distributed across the drives in the pool.
Thin provisioning - space is only consumed from an HDP pool when data is written from the host. In addition, through Zero Page Reclaim (ZPR), you can recover unused capacity.
Faster allocation - In a non-HDP environment there were two options. You could either have predetermined LUN sizes and format the array ahead of time, or you could create custom LUNs on-demand and wait for the format. With HDP you are able to create custom-sized LUNs and begin using them immediately.

Most of our customers move to HDP as part of an array refresh. Whether it's going from an AMS 1000 to an AMS 2500 or a USP-V to a VSP, they get the benefits of both HDP and newer technology. While this is great from an overall performance perspective it makes it difficult to quantify how much of the performance gain is from HDP vs. how much is from using newer hardware.

We do have one customer with a USP-VM that moved from non-HDP over to HDP, though, and I thought it was worth sharing a couple of performance metrics both pre- and post-HDP. Full disclosure - the HDP pool does have more disks than the traditional layout did (80 vs. 64), and we added 8 GB of data cache as well. So, it's not apples-to-apples, but is as close as I've been able to get.

First we have Parity Group utilization:

As you can see, back-end utilization completely changed on 12/26 when we did the cut over. Prior to the move to HDP parity group utilization was uneven, with groups 1-1 and 1-3 being especially busy. After the move utilization across the groups is even and the average utilization is greatly reduced.

Second we have Write Pending - this metric represents data in cache that needs to be written to disk:

Here you see results similar to the parity group utilization. From cutover on 12/26 until 1/9 write pending is basically negligible. From 1/9 to 1/16 there was monthly processing, corresponding to the peak from 12/12 to 12/19 in the previous month, but as you can see write pending is greatly reduced.

The peak in write pending between 12/19 and 12/26 is due to the migration from non-HDP volumes to HDP volumes. In this case we were also changing LUN sizes, and used VERITAS Volume Manager to perform that piece of the migration.

The difference pre- and post-HDP is compelling, especially when you consider that it's the same workload against the same array. If you're on an array that doesn't support wide striping, or if you're just not using it today then there's an opportunity to "do more with less."

Monday, July 11, 2011

ESX Site Recovery Manager with NetApp Storage

I'm happy to report that configuring SRM using NetApp storage and SnapMirror is a relatively straightforward operation. That is to say, not any surprises and things pretty much work like you'd expect. The nifty thing about NetApp is that it doesn't require identical arrays at each site; in fact, you can have small regional VM farms (running on something like a small workgroup FAS2040) and SRM those back to your core datacenter running a big-dog FAS6280. I don't have that kind of horsepower to play with in the lab, but down below I'll show you how I protected a FAS270 from yester-year up to a larger 3020 array. And for fun, I did it across a T-1 line. Your mileage may vary, and your co-workers will likely be peeved when you hog up all the bandwidth during that initial sync (I know mine were). Incremental sync jobs afterwards didn't produce hardly any complaints, by the way.The video doesn't detail the initial software installation. Suffice to say, you'll need the SRM installalable from VMware and the NetApp SRA, which is helpfully called the NetApp Disaster Recovery Adapter (NOW login required to download) if you're searching for it. Both should be installed on a dedicated SRM systems, one at the primary site and one at the recovery site.
Other things you will need:

A NetApp head at each site, running at least OnTAP 7.2.4

A SnapMirror license installed at each site

A SnapMirror relationship defined and established for your primary datastore

A FlexClone license (required only to enable the test failover function, as demonstrated in the video)

There's a couple of 'gotchas' when planning this configuration too, at least with 1.4 version of the SRA:

The datastore FlexVols can only have a single SnapMirror relationship, which is to the secondary location. No daisy-chains. This also limits the ability to have multiple recovery sites for a single primary site.

Replication should be done with plain-old Volume SnapMirror. (Qtree-SnapMirror might work and isn't explicitly unsupported, but would be an unwise plan).

SyncMirror however is explicitly unsupported in conjunction with SRM. That should present less of an issue. If you're lucky enough to have SyncMirror, your single ESX cluster should probably span the two sites. So no SRM required. You can still run regular SnapMirror along with SyncMirror to get the VMs off to a third more distant location.

Monday, July 4, 2011

Storage Performance Concepts Entry 5

Cache Architecture - Part 1

The last area we want to cover is cache. Cache or Cache Memory is just that – memory / DIMMs installed in an array to serve as a high speed buffer between the disks and the hosts. Most storage arrays are what are referred to as cache centric, meaning that all reads and writes are done through cache not directly to disk. In addition to user data / host IO, cache can be used to store configuration information and tables for snapshots, replication or other advanced features that need a high speed storage location. The data in cache must be protected and this is most commonly done with mirroring. In some cases all of the data is mirrored, in others the write IOs are mirrored while reads are not, since the read IOs already exist on the disk.

A common question is “in an array with 16GB of cache how much is really available for user / host IO?”

The exact details depend on the array and the configuration you have but the following concepts should be fairly constant. For example I am using an HDS AMS 2000 Series array.

· 16GB of Cache (8GB per controller)

· A percentage of cache is dedicated to the system area. This varies depending on the hardware configuration and whether or not features that use cache such as replication or Copy on Write Snapshot are enabled. Assuming that replication and Copy on write are not in use 3,370MB total or 1,452MB per controller will be dedicated to the system area leaving 13,480MB or 6,740MB per controller.

· Next each controller mirrors its cache to its partner, 13,480MB becomes 6,740MB or 3,370MB per controller.

· The last calculation depends on the type of IO. All arrays deploy some mechanism to keep the cache from being overrun with write IO requests. How this works is, a threshold is set that when met tells the array to begin throttling back incoming host write requests. In the case of the AMS 2000 series that threshold is 70%. Note that this is for write IO not reads. In a worst case scenario when performing 100% writes the available cache is limited to 70% of the 6,740MB number - 4,718MB total or 2,359MB per controller.

Looking at these numbers many are initially surprised by how little cache is actually available for user IO. It’s interesting to note that we rarely have cache related performance issues with these arrays. The reason has to do with the way that cache operates in a modern storage array. The following diagram created by our CTO and illustrates the relationship between cache and the physical disk drives.

Cache to Disk Relationship

The cache is wide, it provides a lot of IOPS but it is shallow – there isn’t much capacity. The disks are deep, they hold a lot of capacity but individually they aren’t particularly fast. What seems to be impacting the relationship most significantly is wide striping. Wide Striping allows you to pool multiple array groups and more effectively distribute IO across more disks. The result is that writes in cache are flushed to disk more quickly, keeping cache available for incoming IOs. Referring back to our funnel diagram we are essentially widening the bottom of the funnel. Wide striping is not unique to HDS, it is a common feature on many arrays and it just one example of how storage vendors attempt to balance an array. In our next entry we will take a look at the role of cache with various workloads.

Monday, June 27, 2011

Performing a CommVault CommServe Recovery

Prior to the release of Simpana 9 SP1, upgrades from previous versions could get a little hairy. Sure, CommVault had its upgrade tool (on Maintenance Advantage) to verify that the CommServe could be upgraded. However, this really didn’t cover the critical portion of the upgrade, the MS SQL 2005 to 2008 upgrade. I’ve always recommended testing the upgrade in a lab environment first…just in case! This is where the DR process comes in handy.

For the purpose of this exercise, I will be recovering a version 7 CommServe. After installing the OS and the Simpana 7 CommServe on the VM, a copy of the DR backup set will be loaded into the test CommServe. I have outlined the instructions below.

Perform the following on your Production CommServe:

1) From the CommCell GUI, veriify that the CommServe has the latest updates installed.

2) Take a manual DR backup. Right click on the CommServe, and select “Run DR Backup”

3) Determine the location of the DR backups. In the control panel, double-click on DR Backups.

The DR backup location is located under the “Export Settings Tab”.

4) In the DR backup directory, you will see the DR backups (which include SQL dump, registry settings, etc.) The SET_### consist of a single DR backup. Make a copy of the latest SET_### directory and copy it to the test CommServe VM.

Perform the following on your Test CommServe:

1) Install the CommServe software. A 30 day temporary license will be granted with plenty of time to complete testing. I would highly recommend purchasing the CommServe DR License. This license allows your CommServe two different IP addresses, one for the production site and one for a DR CommServe. The price for this license is miniscule compared to the benefits you gain by having an extra DR CommServe.

2) Verify the location of the DR backup copied from the production CommServe

3) Run the CommServe DR Recovery GUI

\\software_install_directory\base\CommServeDisasterRecoveryGUI.exe

Enter the following parameters in the DR Recovery GUI

SQL Restore Tab

a) Restore file: Select the commserv__FULL.dmp file to restore.

b) Restore path: Verify that the directory where you wish to restore the SQL DB’s.In my case, I installed the default CommServe DB on the C: drive, therefore, I can restore to the original path.

Disregard the following error message. You will be overwriting the existing default CommServe database.

Please note, CommVault Services will need to be restarted.

c) (Optional) Enter valid Mail Server/Recipient information to receive mail notification

d) Click "OK" and verify that the recovery is complete.

Name/License Change Tab

a) Check the CommServe Name Change to the hostname of your Test server. The CS DR GUI will automatically populate this information.

b) Click “OK”. The message window at the bottom will state the Hostname change is complete. Also, notice that the “Commserve Name Change” is no longer checked.

Post Recovery Tab

This is an optional piece of the DR recovery process. If your CommServe was also your Media Agent, you may want to disable schedules and media agent processes. I typically select the following items

a) Check Perform PostRecovery Operations

b) Check “Delete all Active Jobs”

c) Check “Disable Media Agent”

d) Check “Disable schedules"

e) Click "OK"

4) Verify that all CommVault Services have started.

5) Run the CommCell Console. Keep in mind that the cvadmin password will be the password you used when installing the CommServe. If you have AD authentication turned on, then you should be able to use these credentials as well.

Now you can test your CommServe upgrade without impacting production!!

Please note that this process does not take into account any media agent or other iDA installations that may have been installed on the production CommServe.