Sunday, March 14, 2010

Building Large File Systems on Hitachi High Performance NAS and AMS

High Performance NAS or HNAS is Hitachi’s OEM of the Bluearc Titan system. At a high level HNAS is a NAS gateway built using Field Programmable Gate Arrays and bandwidth dedicated to each function. HDS offers the HNAS solutions attached to either their enterprise or midrange storage arrays.

One of the major benefits of HNAS is its ability to scale. An HNAS cluster can scale to 8 nodes, 4PB of capacity and file system sizes of up to 256TB.
We recently deployed a 4 node HNAS 3200 cluster with around 600TB of capacity. The intention is to grow the environment to at least 2PB and possibly the 4PB maximum. As with any technology there are configuration guidelines and these often become more important as you approach the maximum capabilities of the technology.

In our configuration each HNAS node has 8 4Gb Fibre channel ports and 2 10Gb Ethernet ports. We are using 3 AMS 2500 Storage arrays with a mix of SAS and SATA drives. SATA are used for large sequential IO and SAS for random or mixed IO.

When architecting a solution of this type there are a number of key parameters that must be considered. Best practices are certainly available but often they don’t directly apply when a configuration will be pushed to the outer edge of its capabilities. This is true for more than just HNAS. For example the HDS USP V can scale to 247PB of internal and external storage but if I was asked to architect one this size I’d need to pack a lunch!

HNAS Storage Management Concepts

The first thing you need to understand is how Storage is allocated and organized within HNAS. The following diagram is from the HNAS Administration guide and illustrates at a high level, how storage is organized.

Although the diagram is helpful, it does not fully explain all the pieces involved or their relationships. For clarity we need to expand on the diagram.

  • Physical Disk Drives are grouped into array groups within the disk subsystem and LDEVs or LUNs are carved out of these array groups and presented to HNAS. When using standard array groups – not Hitachi Dynamic Provisioning (HDP), it is recommended that each array group be presented as a single LUN. It is also recommended that RAID 5 groups be created as 7+1, and RAID 6 as 8+2. From the HNAS console you must allow access to the assigned LUNs at which point they become System Drives (SDs). A cluster can be assigned a maximum of 256 LUNs, however 128 is the recommended maximum based on the amount of time it would take to migrate the LUNs from one node to another in the event of a failure. This limitation will be corrected in future code releases.

What may not be clear is that if you follow the RAID Group size recommendations and each array group is presented as a single LUN, you may limit the scalability of the cluster.

For example using 450GB SAS drives in a 7+1 configuration results in each LUN being ~ 3.15TB, multiplied by the maximum of 256 equals 806.4TB. For this reason it is necessary to use Dynamic Provisioning to meet the maximum possible capacity. HDP allows you to create significantly larger LUNs and reach the 4PB maximum cluster configuration. In our environment we were given permission to use HDP although this feature was not yet GA. We were asked not to release specific information about the HDP configuration until it is officially announced, however there is no magic and all of the same parameters must be configured they are just set a certain way for HDP.

  • System Drives. As system drives are added they must be placed into System Drive Groups, even if there will only be a single drive in each group. The purpose of SD Groups is to make HNAS aware of the underlying array groups that each LUN is created from. LUNs in the same SD group are assumed to be from the same array group and therefore HNAS will not access SDs from the same group at the same time. Let’s consider what this means.
If you follow best practices and each array group is a single LUN each SD group will contain only a single LUN. Adding 16 SDs should result in 16 SD groups.

If you add 16 SDs and put them in a single group HNAS will only write to 1 LUN at a time. This would have a major negative impact on performance.

  • Stripe Sets. Once access has been allowed to the LUNs and the appropriate SD groups created the SD groups are then placed into a stripe set. Stripe sets are exactly what they sound like. If you create a 4 way stripe the data is distributed across all 4 SDs, an 8 way stripe, distributed across all 8 SDs. The interesting thing is that the interface and associated documentation doesn’t make this clear. In fact the GUI has no indication of the existence of stripe sets, you can’t see them, and you can’t manage them. In addition there really isn’t a step where you define Stripe Sets, they are simply created based on the LUNs that you add to a Storage pool at a given time. For example initially I may add 10 SDs to a pool – remember that this would also be 10 SD Groups. I now have a 10 way stripe. Later I add another 3 SD Groups to the pool, this creates a second 3 way stripe within that pool. Later I add another 5 SD Groups, I now have 3 Stripe Sets.

o Storage Pool 1
  • 10 Way stripe
  • 3 Way Stripe
  • 5 Way stripe

It should be obvious that this is less than ideal since each stripe set would have different performance characteristics. For this reason it is recommended that SD groups always be added in the same increments. For example always add them in groups of 8. It should also be clear that this impacts granularity. If I always add 8 SDs at a time and each SD is 5TB in size , then the smallest amount I will want to allocate is 40TB. Understand that this is not a limitation, you can add Stripe Sets of different size and even SDs of different size but you shouldn’t.

  • Storage Pools. Storage pools also known as spans from the command line are the logical container for one or more SD Groups. As we mentioned previously, technically SD groups are organized into Stripe Sets that are then organized into pools. Storage is allocated from the Storage Pool to file systems in small allocations called Chunks. When you define a storage pool you also define a chunk size, this controls how much capacity will be given to a file system at a time. By default Chunks are only allocated as needed, allowing the HNAS to provide thin provisioning. You also have the option of allocating all of the chunks upon file system creation. We found preallocating Chunks to be necessary in certain instances.
From a client perspective the file system size is based on the Chunks that have been allocated, not the size that it can grow to. This can lead to a client believing there is insufficient space available.

The metric used to determine when to add Chunks to a file system appears to be based on the percentage of free space available from the Chunks that have already been allocated.

In our case the client application would check the free space prior to writing and determine that there was not enough space available and error out. By the time we would look at the file system it would have allocated more Chunks and appear to be fine. Preallocating Chunks to the entire file system solved this issue. There does not appear to be a method to modify how Chunks are allocated, it is either auto or full.

Chunk size is important because a Storage Pool can have a maximum of 16,384 Chunks and a single file system can be allocated a maximum of 1023 Chunks. Choosing a Chunk size that is too small will limit how large you can scale a file system.

  • File Systems are the main storage component of the HNAS Platform. A File System is created from a single storage pool and allocated capacity in Chunks based on the Chunk Size defined for the storage pool. File Systems consume capacity based on their block size, which is either 4K or 32K.

HNAS works as advertized and very large file systems and clusters can be created. Based on our experience HNAS provides excellent performance even when varied IO profiles. The key is to ensure that when you design the architecture you understand all of the components involved and the impact they have on one another.

HNAS Parameters and Thresholds