The concept of storage tiering is relatively straightforward, rather than deploying one class of storage you implement multiple classes so that the cost of the storage can be aligned with the performance, availability and functional requirements of the applications and services it will support.
An example might be to use high capacity SATA for backups, high speed SAS in a RAID 10 configuration for databases and RAID 5 for application servers. This would be considered basic storage tiering and most organizations have already embraced this strategy. Although this alone certainly offered some cost savings it did have some limitations.
When you architect a solution like this you are often making a bet on what the requirements for a given application will be. Availability and functional requirements usually remain constant but performance requirements often change. You may initially deploy an application to RAID 5 but later determine that it does not provide the necessary performance and that RAID 10 would have been more appropriate. The reverse is also true, the initial application requirements may indicate that RAID 10 is necessary but upon implementation it is discovered that the performance demands were not nearly as high as anticipated and that RAID 5 would have worked fine. Essentially you have a tiered storage environment but you don’t necessarily have a good method for aligning the tiers with the applications.
The earliest method for solving this was to incorporate some type of data mobility solution that allowed LUNs or volumes to be migrated between tiers without an outage. If an application’s storage performance requirements change or the initial design was incorrect you have the ability to make adjustments without much difficulty or disruption to the application environment.
When these types of solutions initially hit the market they were positioned pretty aggressively. A common concept was that you could move data between tiers as part of standard operating processes. For example the accounting databases could reside on RAID 5 most of the time but for end of month processing they could be migrated to RAID 10.
The main challenge with this approach is a lack of granularity. LUNs and volumes have continued to increase in size and the larger they get the longer they take to migrate between tiers. These larger capacities also impact the performance of the array overall since migration operations consume IOPS just like any other IO. This lead us to the latest evolution – sub LUN level tiering.
Sub LUN level tiering expands the tiering concept in two key ways.
- Each vendor’s implementation differs but in each case some unit of capacity smaller than a LUN is defined and allowed to migrate between tiers. The unit of capacity may be fixed or variable in size. Where new data is written also varies, some arrays always put new data to the highest performing tiers and then migrate down based on access patterns while others work in the reverse direction. In either case the result is that smaller amounts of data can be migrated which takes less time than migrating complete LUNs and requires less overall IOPS to do so. This solution should also require less high performance storage and cost less.
- The second change is that the migrations are now most frequently automated. Rather than the administrator selecting LUNs or volumes to move, the array migrates data automatically based on its algorithms and defined policies. The most frequently accessed data is moved to the highest performing tier while infrequently accessed data trickles down to the lower performing tiers.
This is a significant improvement and makes real time or at least near real time migrations possible. One of the most common use cases for this approach is for adding Solid State Disk (SSD) to an existing storage architecture. Solid State Drives are incredibly fast but they are also expensive when viewed from a capacity standpoint. For example, consider a 4TB database that needs 10,000 IOPS. The IOPS requirement can be met with only a few Solid State Drives but in order to meet the capacity requirements you have to purchase multiple SSD array groups, driving up the cost and resulting in tens of thousands of unused IOPS. Remember that just because the storage may be able to handle 100K IOPS doesn’t mean you are going to use them. Sub LUN tiering can address this problem by allowing you to purchase a smaller amount of SSDs combined with traditional spinning disk and migrating the data between the tiers.
This is a great tool and has the potential to lower costs significantly while maintaining or even improving performance. That being said, since this is the shiny new feature available in many vendor offerings it is frequently being positioned too aggressively and without enough consideration as to how it will actually work in a given environment. Here are some things to consider.
- If the performance and capacity requirements can be met with traditional allocation methods sub LUN tiering may not offer a lot of advantages. Taking the same 4TB database example from above but with 5K IOPS in performance requirements rather than 10K illustrates this point.
o Using 450GB drives in a RAID 10 4+4 configuration we get 1,250 IOPS per array group and 1.6TB of capacity. Using 4 array groups we get 6.4TB usable and 5K IOPS. Assuming we pay $1,000 per drive the solution costs $32,000.
o Solving the same problem with sub LUN level tiering might look as follows. Using 450GB drives in an 8+1 RAID 5 configuration we need 2 array groups to meet the capacity requirements. This provides 6.5 terabytes of capacity but only provides 2,130 IOPS. We add a tier of RAID 5 2+1 400GB SSD to this providing another 768GB of capacity and another 10K IOPS. This solution now has a total of 7.2TB of capacity and 12,130 IOPS. Using the 1K per drive cost times 18 drives equals 18K plus 10K per SSD equals $48,000. In this scenario the sub LUN tiering approach costs 50% more to meet the same requirements.
Some may look at my comparison here and say “well, that’s not a very good example – you created a scenario to intentionally make the standard solution look better”. Well that’s true and it is the point I was trying to illustrate. In some cases you may not want to use sub LUN tiering because you don’t gain anything. The ability of a storage array to do sub LUN tiering doesn’t mean that you should ignore the application requirements or discontinue looking at traditional allocation methods.
In addition to the cost differences indicated in the above example you need to understand that some amount of read IOs will be fulfilled by the lower performing tiers. If that number is small the solution will probably work fine but if it occurs too often the overall performance will suffer. This is similar to what we see in file archiving solutions. File archiving is a great way to lower costs by migrating infrequently accessed files to a lower cost storage tier but if your archiving policies are too aggressive the archive can get overworked and negatively impact the user experience. This is the same with sub LUN level tiering - your ratio of high performance versus high capacity storage must be correct for your workload.
Unfortunately many organizations do not have a good understanding of their performance requirements much less how much of the data is hot – accessed frequently versus cold – accessd infrequently. With unstructured data such as traditional file servers this is fairly easy to determine simply by looking at the last access times. Within databases it becomes a bit more complicated. Vendors that offer sub LUN level tiering are actively working on tools to help assess the requirements but they are application specific and not available for every scenario.
You also need to be cognizant of activities that may interfere with the algorithms used to determine where data should reside. For example backup operations may look to the array like standard access and be factored in. To avoid this some array manufacturers suggest that you omit monitoring during the backup windows. Depending on your backup environment this can be challenging.
None of this it to suggest that sub LUN level tiering is bad but rather that it is a tool just like any other and you need to put some thought into how and where you will use it. We are currently working on a solution that will put roughly 400TB of database capacity into a sub LUN tiering solution. We’ve collected performance data on all of the databases involved over a reasonable sample period. In architecting the solution we have added a significant high performance buffer to ensure we have the IOPS we need. When it comes time to implement we will be moving the databases over one at a time and monitoring the performance closely.
This 400TB represents about one quarter of the total capacity in the environment. We are not using sub LUN tiering in the rest of the environment because the IO profile wouldn’t benefit from it and performance would likely decrease.
As we work through the actual implementation will post what we find.
Great post!
ReplyDeleteWhat about the 5th entry regarding cache? Any progress there?
regards
Andreas
Andreas,
ReplyDeleteThanks, I'm sure Terry will appreciate the compliment. He's on vacation this week, but look for the next installment the week of July 4th.