Sunday, March 14, 2010

Building Large File Systems on Hitachi High Performance NAS and AMS

High Performance NAS or HNAS is Hitachi’s OEM of the Bluearc Titan system. At a high level HNAS is a NAS gateway built using Field Programmable Gate Arrays and bandwidth dedicated to each function. HDS offers the HNAS solutions attached to either their enterprise or midrange storage arrays.

One of the major benefits of HNAS is its ability to scale. An HNAS cluster can scale to 8 nodes, 4PB of capacity and file system sizes of up to 256TB.
We recently deployed a 4 node HNAS 3200 cluster with around 600TB of capacity. The intention is to grow the environment to at least 2PB and possibly the 4PB maximum. As with any technology there are configuration guidelines and these often become more important as you approach the maximum capabilities of the technology.

In our configuration each HNAS node has 8 4Gb Fibre channel ports and 2 10Gb Ethernet ports. We are using 3 AMS 2500 Storage arrays with a mix of SAS and SATA drives. SATA are used for large sequential IO and SAS for random or mixed IO.

When architecting a solution of this type there are a number of key parameters that must be considered. Best practices are certainly available but often they don’t directly apply when a configuration will be pushed to the outer edge of its capabilities. This is true for more than just HNAS. For example the HDS USP V can scale to 247PB of internal and external storage but if I was asked to architect one this size I’d need to pack a lunch!

HNAS Storage Management Concepts

The first thing you need to understand is how Storage is allocated and organized within HNAS. The following diagram is from the HNAS Administration guide and illustrates at a high level, how storage is organized.

Although the diagram is helpful, it does not fully explain all the pieces involved or their relationships. For clarity we need to expand on the diagram.

  • Physical Disk Drives are grouped into array groups within the disk subsystem and LDEVs or LUNs are carved out of these array groups and presented to HNAS. When using standard array groups – not Hitachi Dynamic Provisioning (HDP), it is recommended that each array group be presented as a single LUN. It is also recommended that RAID 5 groups be created as 7+1, and RAID 6 as 8+2. From the HNAS console you must allow access to the assigned LUNs at which point they become System Drives (SDs). A cluster can be assigned a maximum of 256 LUNs, however 128 is the recommended maximum based on the amount of time it would take to migrate the LUNs from one node to another in the event of a failure. This limitation will be corrected in future code releases.

What may not be clear is that if you follow the RAID Group size recommendations and each array group is presented as a single LUN, you may limit the scalability of the cluster.

For example using 450GB SAS drives in a 7+1 configuration results in each LUN being ~ 3.15TB, multiplied by the maximum of 256 equals 806.4TB. For this reason it is necessary to use Dynamic Provisioning to meet the maximum possible capacity. HDP allows you to create significantly larger LUNs and reach the 4PB maximum cluster configuration. In our environment we were given permission to use HDP although this feature was not yet GA. We were asked not to release specific information about the HDP configuration until it is officially announced, however there is no magic and all of the same parameters must be configured they are just set a certain way for HDP.

  • System Drives. As system drives are added they must be placed into System Drive Groups, even if there will only be a single drive in each group. The purpose of SD Groups is to make HNAS aware of the underlying array groups that each LUN is created from. LUNs in the same SD group are assumed to be from the same array group and therefore HNAS will not access SDs from the same group at the same time. Let’s consider what this means.
If you follow best practices and each array group is a single LUN each SD group will contain only a single LUN. Adding 16 SDs should result in 16 SD groups.

If you add 16 SDs and put them in a single group HNAS will only write to 1 LUN at a time. This would have a major negative impact on performance.

  • Stripe Sets. Once access has been allowed to the LUNs and the appropriate SD groups created the SD groups are then placed into a stripe set. Stripe sets are exactly what they sound like. If you create a 4 way stripe the data is distributed across all 4 SDs, an 8 way stripe, distributed across all 8 SDs. The interesting thing is that the interface and associated documentation doesn’t make this clear. In fact the GUI has no indication of the existence of stripe sets, you can’t see them, and you can’t manage them. In addition there really isn’t a step where you define Stripe Sets, they are simply created based on the LUNs that you add to a Storage pool at a given time. For example initially I may add 10 SDs to a pool – remember that this would also be 10 SD Groups. I now have a 10 way stripe. Later I add another 3 SD Groups to the pool, this creates a second 3 way stripe within that pool. Later I add another 5 SD Groups, I now have 3 Stripe Sets.

o Storage Pool 1
  • 10 Way stripe
  • 3 Way Stripe
  • 5 Way stripe

It should be obvious that this is less than ideal since each stripe set would have different performance characteristics. For this reason it is recommended that SD groups always be added in the same increments. For example always add them in groups of 8. It should also be clear that this impacts granularity. If I always add 8 SDs at a time and each SD is 5TB in size , then the smallest amount I will want to allocate is 40TB. Understand that this is not a limitation, you can add Stripe Sets of different size and even SDs of different size but you shouldn’t.

  • Storage Pools. Storage pools also known as spans from the command line are the logical container for one or more SD Groups. As we mentioned previously, technically SD groups are organized into Stripe Sets that are then organized into pools. Storage is allocated from the Storage Pool to file systems in small allocations called Chunks. When you define a storage pool you also define a chunk size, this controls how much capacity will be given to a file system at a time. By default Chunks are only allocated as needed, allowing the HNAS to provide thin provisioning. You also have the option of allocating all of the chunks upon file system creation. We found preallocating Chunks to be necessary in certain instances.
From a client perspective the file system size is based on the Chunks that have been allocated, not the size that it can grow to. This can lead to a client believing there is insufficient space available.

The metric used to determine when to add Chunks to a file system appears to be based on the percentage of free space available from the Chunks that have already been allocated.

In our case the client application would check the free space prior to writing and determine that there was not enough space available and error out. By the time we would look at the file system it would have allocated more Chunks and appear to be fine. Preallocating Chunks to the entire file system solved this issue. There does not appear to be a method to modify how Chunks are allocated, it is either auto or full.

Chunk size is important because a Storage Pool can have a maximum of 16,384 Chunks and a single file system can be allocated a maximum of 1023 Chunks. Choosing a Chunk size that is too small will limit how large you can scale a file system.

  • File Systems are the main storage component of the HNAS Platform. A File System is created from a single storage pool and allocated capacity in Chunks based on the Chunk Size defined for the storage pool. File Systems consume capacity based on their block size, which is either 4K or 32K.

HNAS works as advertized and very large file systems and clusters can be created. Based on our experience HNAS provides excellent performance even when varied IO profiles. The key is to ensure that when you design the architecture you understand all of the components involved and the impact they have on one another.

HNAS Parameters and Thresholds


Wednesday, March 10, 2010

Warning when using HDS Storage Navigator Modular 2 (SNM2) to edit Host Groups on AMS2000 storage arrays

The following is a warning for customers using HDS Storage Navigator Modular 2 (SNM2) to edit Host Groups on AMS2000 storage arrays. We wanted to pass this information along in case you run into this situation.

When using Storage Navigator Modular 2 (SNM2) to edit a host group on an AMS2000 array you have the option to select multiple ports (see screenshot #1 below). You should not select multiple ports when editing host groups unless you have configured all the ports so that the host group numbers match exactly. If you do select multiple ports and the host group numbers do not match, then you risk losing access to your storage for the given host group number on the other ports.

When you select multiple ports SNM2 modifies the host groups based on group number, not name. As an example, screenshot #1 below will modify Group 002 on each port that you have selected regardless of what you have named the host groups. Host Group 002 may or may not have the same configuration (Name, associated LUN mappings, etc) across the ports based on the order in which you have created host groups across the different ports.

If you do select multiple ports you will get the warning that is shown in screenshot #2 that you are editing host groups across multiple ports. The key here is that SNM2 is going to edit the Host Group on the other ports based on Host Group number and not the Host Group Name.

Refer to Screenshot #3 and look at Host Group 003 across all the ports. Host Group 003 is different for ports 0A and 0B. If you were modifying Host Group 003 on port 0A and selected all ports you would overwrite the Host Group 003 on ports 0B and 1B.

As another example look at host Group Named “Test Group 3” across all ports and notice that it has a different Host Group number on port 0A and port 0B. For this reason we recommend that you do not select multiple ports when editing Host Groups on the AMS2000 arrays using SNM2.

If you have any questions you can contact Lumenate at (866) 358-8999.

Screenshot #1

Screenshot #2

Screenshot #3


Sunday, March 7, 2010

Storage Performance Concepts - Entry # 1

Two key metrics are used to define the performance capabilities of storage arrays and for that matter the underlying physical disk drives themselves. The physical disk drives are what we will be discussing here, since this seems to be the area most people are interested in when evaluating a new storage solution. Notice that I said disk drives, and by this I mean those things with spinning disk in them. Solid State Drives are another matter and we will cover them another time.

IOPS – Refer to the number of read or write IOs that a device can perform per second. In general the IOPS rating provided by the manufacturer (If it is provided at all) is based on random IO and given for both read and write operations. The difference between read and write capabilities is based on the fact that the access time is higher for write operations than it is for read operations. So what if the manufacturer does not provide IOPS information, how can you determine the drives IOPS capabilities?

Here is the generally accepted formula along with an example:

1 / Average Latency in ms + Average Seek Time in ms

Notice that the Fibre Channel and SAS drives provides the same IOPS. This is because with the exception of the interface these drives are the same and the interface has no impact on IOPS. You may also notice that I have not indicated the size of the drives. This is intentional. While there are performance differences between drive generations the drive size has no impact on IOPS. IOPS capabilities are based on the speed of the drive, the RPMs and seek time which when combined are referred to as access time.

MBps – Refer to the Megabytes that can be transferred from a device per second. The MBps rating most often provided by the manufacturer is the Interface Transfer Rate, for example a 4Gb FC drive has an Interface Transfer Rate of 400MBps, a 3Gb SAS device – you guessed it, 300MBps. This can be somewhat confusing since the Drive Transfer Rate or Sustained Transfer Rate – the MBps that the drive itself can actually handle is always less than the Interface Transfer Rate.

In general when evaluating disk drive options from a MBps standpoint you should be concerned with the Sustained Transfer Rate not the Interface Transfer Rate. Interface Transfer Rate is only relevant when you are accessing more than one disk device.

The Sustained Transfer Rate is normally provided as a range such as 198 to 119MBps. This is because the amount of data that can be transferred per second is higher on the outer tracks of the drive surface than it is as you move towards the center.
So when it comes to MBps use the Sustained Transfer Rate as a guideline.
Now that we have covered the metrics used to describe the performance capabilities of a drive, how should you use this information?

Whether you should be concerned with IOPS or MBps depends on how you intend to use the drives – the application requirements and IO profile. Here are some general rules of thumb.

• Applications with a random IO profile such as databases and email servers typically need IOPS more so than MBps.

• Applications with a sequential IO profile such as video or audio streaming, File servers and disk backup targets usually need MBps more so than IOPS.

Again, these are rules of thumb not hard and fast rules. IOPS and MBps are related it is just that an application will typically be limited by one or the other depending on its IO profile.

Simple right? Well not so fast. While the capabilities of an individual drive is important, most of the time you will be grouping these drives into some type of RAID configuration and each RAID configuration uses the available IOPS in a different way. So there’s a little more to consider. Look for information on RAID configurations in entry # 2.

Thanks to Tom Granberry for helping with this post.