Monday, December 13, 2010

Data replication options - Part Two (Application-based replication)

At the top of the IT stack lies the application.  This is what the end-user interfaces with and what you're most likely to hear "is down" in the event of an outage.  Since the application is what you're actually trying to protect against a disaster it makes a great deal of sense to leverage any built-in options for data replication.

And luckily applications have added built-in replication as the importance of disaster preparedness increased.  A few brief examples:

Oracle offers Data Guard for their RDBMS.  While this solution could have been considered simply log shipping in the past, today they position it as a complete disaster recovery solution.

For the popular Exchange Server, Microsoft offers Cluster Continuous Replication (CCR), while for SQL Server there are options both for log shipping and transactional replication.  A more detailed discussion of the SQL Server options is available here.

If you define "application" broadly enough to include infrastructure services, then there are typically options there as well.  DNS, LDAP, and Active Directory are all architecturally designed so that they may be deployed in a fashion that is redundant across multiple sites - the key is to recognize the need for this redundancy, deploy appropriately, and test.

Given that the application is what we're trying to protect, then why doesn't everyone just rely on application-based replication?  Well, there are a couple of considerations:

First of all, most environments have multiple applications that they're trying to protect against a disaster.  In the same way that the real test of a backup is whether or not you can restore, the real test of a disaster recovery solution is whether or not you can recover.  When you use an application-based replication solution then when a disaster happens you have to have a person available at the remote site who knows enough about the application to perform the necessary steps to bring it online.  If you're only running one application (if you're a Software as a Service provide, for example) then that's great.  You have the necessary resources for that one application.  As you start putting more and more applications into the mix, though, the probability that you won't have the right resource available in the event of an emergency increases.

A second reason is that leveraging application-based recovery couples your disaster recovery solution with the support matrix of the application.  This means that as time progresses and the application move through its life cycle you have to include the replication in your considerations.

To summarize:

Pros:

  • Protects the environment at an easily understandable level.
  • Typically cost-effective.
  • No concerns around application support (as it is part of the application).
  • Often includes testing for logical corruption (which other approaches cannot).
Cons:
  • Increased complexity in environments with multiple applications, decreasing the probability of successful recovery of an entire environment in a disaster.
  • Couples replication to the application, meaning that application maintenance and upgrades must include testing and validation of replication.

Share/Save/Bookmark

Monday, December 6, 2010

Storage Performance Concepts Entry 4 - The Real World

In the previous three entries on this topic we discussed several key storage performance concepts.

The Physical Disk Capabilities. Fibre Channel and SAS drives can handle more IOPS than SATA drives, making them a good choice for applications that generate a lot of random IO. From a MBPS standpoint SATA isn’t quite as fast as Fibre Channel and SAS but the delta is much smaller, making SATA acceptable for workloads that mainly generate sequential IO.

The common RAID implementations and their impact on storage performance. RAID 10 provides the best performance for random workloads. RAID 5 and 6 provide good performance for sequential workloads in some cases RAID 5 may actually be faster than RAID 10 – although this isn’t the norm.

The workload, the mix of; random, sequential, reads and writes has a major impact on performance, with writes putting the biggest load on disk drives.

We also showed how this formula could be used to determine the number of array groups that would be needed to meet an applications IOPS based on the disk drives and RAID level you choose.

(TOTAL IOps × % READ) + ((TOTAL IOps × % WRITE) ×RAID Penalty)

We left off pointing out that while this information is valuable it leaves off some of the challenges we face when architecting solutions in the real world. Two factors we have not considered are capacity and cost. The majority of the time we start building our solution based on the capacity requirements.

For example, an organization might need 10TB of capacity to support a new application with a random workload consisting of 75% reads and 25% writes with a peak IOPS load of 2,500. The capacity will be added to an existing array that supports both SAS and SATA drives.

Since this is a random workload we will be recommending SAS drives but aren’t yet sure if this needs to be a RAID 10 or RAID 5 configuration. We could use RAID 6 but since we will be using 450GB drives and our array has multiple hot spares we think that RAID 5 will provide suitable protection for the data.

First we will look at the capacity requirements for each RAID level.

Capacity

RAID 5

RAID 10

RAID Group Size

8+1

4+4

Usable Capacity

3,600

1,800

Required RAID Groups

3

6

Total Capacity

10,800

10,800

Using 450GB LUNs we need twice as many RAID 10 groups as RAID 5 groups to meet the capacity requirements.

Now we will take a look at the cost. We are using the same size drives for each configuration so the cost per drive is constant but we need almost twice as many drives for the RAID 10 configuration. In addition the number of drives in the RAID 10 configuration will require additional drive trays to be added to the array. In our case we are assuming that the drives are $1,500 a piece and that each tray holds 15 drives at a cost of $10,000 per tray.

Cost

RAID 5

RAID 10

RAID Group Size

8+1

4+4

Required RAID Groups

3

6

Total Disks Required

27

48

Trays Required

2

4

Cost of disks

$40,500

$72,000

Cost or trays

$20,000

$40,000

Total Cost

$60,500

$112,000

As you would expect the cost of RAID 10 is almost twice as high as RAID 5. What may not be as obvious are the performance differences between the two configurations. In the past we focused on comparing a single RAID group of each type, keeping the number of drives constant. In this case two things have changed.

1. I’m using an 8+1 array group rather than a 7+1. 8+1 is the RAID 5 configuration recommended by the manufacturer because of the way it aligns with the caching mechanisms of the array. In addition an 8+1 provides sufficient availability and rebuild times while making better use of the raw space.

2. In this real world configuration I have twice as many RAID 10 groups and therefore a lot more disk, hence raw IOPS.

Using 185 IOPS per drive we find that the two configurations have the following characteristics.

Performance

RAID 5

RAID 10

IOPS Per Drive

185

185

# of Drives

27

48

Raw IOPS

4995

8880

We can now use our formula to determine if either solution will meet our requirement.

Performance

RAID 5

RAID 10

Required IOPS

2500

2500

Percent Read

75%

75%

Percent Write

25%

25%

RAID Penalty

4

2

Adjusted IOPS

4375

3125

In our example both RAID 5 and RAID 10 meet the performance requirement. Although the RAID 5 configuration is tighter, it has a reasonable amount of headroom. Given the major difference in cost it is probably reasonable to proceed with the RAID 5 configuration.

Looking at our results you may say “Well the RAID 5 configuration may work but wouldn’t the RAID 10 design be a lot faster?” Well, not necessarily. If the speed limit is 55 and you must drive the speed limit a Ford F150 and a Ferrari will both get you there in the same amount of time. This is the same with storage, just because one configuration could run faster doesn’t mean it will – you have to be able to drive higher IOPS from the host.

The area that we will explore in our next entry is cache. While cache improves performance in general it is particularly beneficial for parity based RAID configurations.


Share/Save/Bookmark

Wednesday, December 1, 2010

Data replication options - Part One (Overview)

As mentioned previously, I'll be going over different approaches to data replication between now and the end of the year.  In this post I'll outline the different points in the data path where replication can occur and then follow up with a post per point (say that fast three times!) outlining pros and cons.


Before touching on the different points it is important to define some terminology.


Using a broad brush, all replication solutions may be divided into either synchronous or asynchronous.  Synchronous replication means that the write must be received at the remote location before the local host receives acknowledgement.  Asynchronous replication means that the local host receives acknowledgement once the write is complete at the local site, and replication of the write is handled, well, asynchronously.


Two quick comments:

  1. Using asynchronous replication means that there will be some data loss in the event of a disaster.  How much depends on several factors including the replication technology in use, the bandwidth between the sites, the change rate of the application(s) in question, and so on and so forth.
  2. In practice, most people use some form of asynchronous replication.  The biggest reason?  Synchronous replication introduces latency into every write - it has to traverse the link between locations and then an acknowledgement has to be sent back.  If the sites are close enough and the budget will support it, then synchronous is possible - but it's expensive.
Replication solutions can be further divided into continuous or discontinuous.  Continuous means that for every write generated at the primary site the same write is performed at the remote site.  Synchronous replication is, by definition, continuous.  Asynchronous replication may be either continuous or discontinuous.  Snapshot-based replication is an example of a discontinuous approach.  Discontinuous replication can lead to significant cost savings in the bandwidth required between sites if the application in question writes to the same location repeatedly, since only data at the scheduled time is replicated and the intermediate writes can be disregarded for replication purposes.

One last bit of exposition - I'll refer to replication as being write-order consistent (or "crash" consistent) throughout these posts - this basically means that writes are applied in-order for the solution.  This is a requirement for the successful replication of databases and other applications.

With that out of the way - here are the five places on where replication can occur, going from the highest level to the lowest level.
  1. The Application - most enterprise applications in use today have some form of replication baked in (possibly at the cost of an additional licensing fee).  Oracle Data Guard, Exchange CCR, MicroSoft SQL log shipping - all have some method of getting the data from point A to point B in a usable form.
  2. The Host - from something as simple to a write-splitter to something as involved as replacing the native volume manager, there are more host-based options for data replication than you can shake a stick at.
  3. The Switch - Several years ago there was a movement to make all storage intelligence (including replication and virtualization) built into the switches. That vision didn't really pan out, but it is still possible to do replication from the switch.
  4. An Appliance - begrudgingly added against my initial biases there are a number of appliances on the market that can perform virtualization.  Some involve virtualization of the storage in question, and could arguably be classified as array-based, while still others involve the installation of write-splitters on the hosts and could be classified as host-based, but I digress.
  5. The Array - Once the domain of only monolithic storage arrays, today most midrange arrays offer data replication solutions that are sufficient for typical needs.
I believe that all replication solutions can be grouped into one of the five categories above.  Coming up next - Application-based replication.

Share/Save/Bookmark