Most organizations have in place what they consider to be a data center disaster recovery strategy and plan. The stark reality, however is that few are actually ready for the worst, and many organizations are ill-prepared. From failure of a single application to the loss of the organization’s entire IT functionality, a plan should be in place to handle any eventuality and restore operations as soon as possible.
You can listen to Jeff’s insight in the player above, and the full transcript of our discussion is below.
Kevin O’Neill, Data Center Spotlight: This is Kevin O’Neill with Data Center Spotlight, and I’m here with Jeff Gilmer of Excipio Consulting. Jeff has been joining us for a series of data center and IT infrastructure and cloud topics, and they’ve all been pretty interesting, and I know you folks have been enjoying them, and we’re on to another topic today, another important topic, and that is, how do you go about right-sizing your data center disaster recovery strategy? What are the steps that you to take to get your DR right? And Jeff, as always, good to have you with us today.
Jeff Gilmer, Excipio Consulting: Yeah, thank you, Kevin, nice to be here, and I hear you were at a conference last week, how did that go for you?
Data Center Spotlight: It went pretty well. I know you’re heavy into the data center conference season, the desert of Nevada transitioning to the fall pollen season in Atlanta hasn’t been kind, but we’ll muddle through this and I’ll try not to offend the listeners with my scratchy voice today, Jeff.
Jeff Gilmer: [LAUGH] Okay, that sounds great.
Data Center Spotlight: Jeff, we’re talking about data center disaster recovery plans, and data center disaster recovery strategies today, so I guess the first question would be, could you define for us, what is a data center disaster recovery plan?
Jeff Gilmer: Yeah, so most people look at it, there’s typically three types of plans that you have within a business organization to recover your services or functions that you’re providing out to your customers, to clients, to whoever it may be. Emergency management plans focused on recovery in just the facility aspects of it, or a building has a fire, where do we go, what do we do? Business continuity talks about the things around recovering the business from a continuity perspective in the case of a disaster, things such as calling trees, and how do you interface with the public, and where do our employees go, and those functions.
What we’re going to talk about today is disaster recovery, which the key word there is recovery. So after the disaster, how do we recover, and do we have plan in place that we can test and validate that gives us the ability to recover once a disaster occurs? And that disaster could be a single application goes down, and we lose our primary functions or services or even our website that our clients interface with us, to a complete disaster where we lose all of our IT functions, and again, here we’re talking about the data center disaster recovery. So we’ll focus on the functions within the data center itself, and the ability to recover.
Data Center Spotlight: Okay, well, good. How do you determine what services are needed to cover all the eventualities for a disaster recovery plan?
Jeff Gilmer: Well, that can be quite a daunting task when you start to look at all the different services that an organization may be providing, and where they act, but what you really need to do is stop and try to reconnect and identify all the services and functions that that business provides. You really want to evaluate and make sure you have a complete listing of both services. Once you have that, then the second phase is, you really need to identify, what of those services are critical, are the most critical? And you may give them a ranking of 1 to 100 for your 100 services, but you may say, these 10 or these 12 are the most critical, then these next 7 are kind of secondary, and then our third level might be these next 25, but you end up with some sort of categorization or buckets of services that you can define some sort of criticality with. So once you get through that, you also need to understand what applications support those services, or the functions that your business is providing. So you do need to have an application inventory as well, to really have the information to go through the process.
Data Center Spotlight: Well, I’m wondering, Jeff, it seems like organizations do a good job defining the requirements they need for their production data center, and their production IT infrastructure in general. What about defining the requirements they need for disaster recovery?
Jeff Gilmer: Yeah, that’s always an interesting discussion from that perspective when we get into it, but there are key factors that are out there in every business and every organization, whether they realize it or not. You don’t want to get into the very subjective items here. You want to be very objective. So, let’s say you’re a public sector organization, the state, county, federal, municipality, any of those organizations, you’re probably going to have state statues that apply to you, or you look at regulatory issues that you may have to comply with. There’s many different types of regulatory issues related to utility companies, financial companies, insurance companies, anybody that’s dealing with public, or some sort of customer service area. You get into compliance issues, compliance has become almost the predominant area within every business, whether they realize it or not.
If a business has a website, if that business performs any sort of transactions over that website, if they’re accepting any sort of monetary value, you get into areas such as compliancy around PCI, or you might have HIPAA, you might have CJIS, you might have FISMA, there’s multiple types of compliancy issues that might be driving a particular criticality of that service that you’re providing, or function. You’re going to have general business requirements, what are those services or businesses that are driving the highest amount of revenue, or the greatest amount of risk, or the least amount of risk, what really limits your risk? So you can start to look at your different business requirements, and probably one of the bigger ones that we find a lot of organizations miss are their insurance obligations. If you really read through your insurance policies, which I know most people don’t like to do, they’ll put you to sleep fairly quickly, but inside your insurance policies, there are likely, more than probably, there are likely disclaimers that you’re not retaining certain amounts of data, or if you can’t recover within certain amounts of time, or you don’t have a particular sort of plan in place to handle the disaster under the disaster recovery plan, that it may void any particular payment by your insurance company. So you really want to understand the legalities of what’s covered and what’s not recovered in relationship to insurance. But again, any of those areas can really help you define quantitative requirements. Legal issues, regulatory issues, compliance issuance, insurance issues, very common today.
Data Center Spotlight: All right, so you make some good points about the insurance and some of the other requirements, and some of the compliance requirements, but setting those aside, and those are certainly important, and they should be part of the mix, but just in general, as far as best practices go, what sort of goal should be placed on getting back to normal from a DR perspective?
Jeff Gilmer: Well, within the industry, probably the most common are what are called the recovery time objective, and recovery point objective. So, you really sit down and define how soon do you want to be able to recover, and more than want to be able to recover, what is your ability to recover a particular service, or a particular function within the business? That’s your recovery time objective. It might be that we’re going to recover this in four hours, we’re going to recover our secondary ones in 24 hours, our third ones in 48 hours, anything beyond that, we’re going to recover in five days. So you have a recovery time objective, and then you have a recovery point objective, so how far back in time are you going to go, or what is going to be the functionality level that you’re going to recover for that particular service? When you have an issue, you’re not likely going to go back seven years and recover seven years of that issue. What you’re really going to do is you’re going to say, hey, we’ve got enough data to get us back functioning within the last 90 days, so we need to recover X service within 90 days’ worth of data. So that’s the data that you’re going to retain in the ability to recover. So again, two key things. Recovery time objective, which is also known as your RTO. Recovery point objective, which is your RPO. And you’re going to want to map each one of those services to the RTO and the RPO, and then we’ll talk about this in a little bit here. What other dependencies need to be required to actually have the ability for your business to meet those RTOs and RPOs?
Data Center Spotlight: Okay, well, Jeff, when you look at the data center disaster recovery plans that most organization have, how complete do you find them to be, are companies and organizations, both corporate and enterprise, and certainly government agencies, are organizations doing a good job with their DR plans?
Jeff Gilmer: Well, that’s a really interesting question, and it’s one that when we start to engage with a client, and we start to look through their plan, we find a wide variety of answers to that. Some common questions that we like to ask, and you probably want to ask yourself are, is your disaster recovery plan clearly defined? Has it been tested? Has there been a real occurrence where you had to recover, and how did that go? Do you have the capacity to recover, and when we mean capacity, we’re not just talking from an IT side, server storage and networking from a data center or facility, we’re talking, do you have the people and the resources to recover, above and beyond the people and resources that are running the day to day operations?
I’ll give you an interesting story. We sit down with clients, and we talk about their testing, and their ability to recover, and their plan, and going through their plan, and the conversation seems to always start out like this. Well, we plan to do our test in August, so in April, we start preparing for our plan. We start putting everything in place, and we start preparing for the test. And we went through all the documentation, and we got all our resources together, and made sure we had all the equipment. And by the time June rolled around, we were ready to really do a second look at it, in July we did another validation of the plan, and in then in August we tested it. So here in reality, they started in April, and they tested it in August, so during the course of five months, they were able to recover. They said the recovery went great, we tested it and we were able to recover the whole piece of all the parts to the puzzle came into place just perfect for us, right?
The biggest issue is, when you have a disaster, you don’t have five months to plan for the ability to recover. You have five hours, or you have minutes, and that’s where I think people make a significant mistake when they really talk about their testing and their plan. So we start to look at that plan, and you really want to do kind of a gap analysis. You want to go and look at, compare the current criticality of the applications and services that you just identified to your current plan. A lot of people, everything they have, they have a disaster recovery for 100% of. Well, does that really make sense? You need to identify the differences between the criticality and your current plan. You need to define your current risks, and risks change all the time. A company that has one data center, and now only has a second data center, those are different risks than a company that has 27 data centers out there. You need to really determine also what’s called a mitigation plan. What if that disaster recovery plan doesn’t work? How are you going to mitigate the risks? What are the other things you’re going to look at? And then the last thing is, in today’s world, look at your potential options. There can be potential facility options, from basic floor space, colo space, the colocation to active-active type environments, to managed, to outsourced, and even places where you’re not even considering the facility, you’re considering the ability to recover, such as a cloud solution. So, all of those really need to be looked at, as you start to look at the risks and the gaps between your current DR plan and what you really need.
Data Center Spotlight: All right. Well, Jeff, when we were talking about people’s primary IT infrastructure and hybrid services, and what they put in the cloud versus what they put on dedicated infrastructure. There’s the whole element of how the infrastructure and the applications interact. I would imagine that would be a factor in your disaster recovery plan as well.
Jeff Gilmer: Yeah, obviously it’s very difficult to have a recovery plan that’s actually going to allow you to recover if you don’t understand all the dependencies, and as most of us know, a business service is not typically run by one application. You might have a primary application that is functioning for that particular business service, but that application probably ties back to the facility server, it probably ties back to the database server, typically to a web server or some sort of front end source, so people can access the application. So you’ve got multiple dependencies within that particular application.
A lot of times, when we go in and talk to our clients, our first question is, do you have an application inventory? And what you find is that the group that functions, or operates a particular application, they understand that application, but they don’t understand any of the other applications functioning within that IT structure. So you need to really understand all of the applications that are functioning and then you need to understand which applications are dependent upon other applications, and map those applications. Next, you then you need to take and map that, once you have all those applications mapped, you have to identify what infrastructure is running those applications. So, you might have server instance running application B, server instance D running application B as well, or application E, F, and G, and you really need to gather that data so that you map for that application, what dependent applications there are. From that primary application, what server instance is supporting that application? How much data is supporting that application? How much storage do I require? How much network bandwidth do I require? And then not only for that primary application, but you need to understand that for all of your secondary application dependencies that are supporting it, and all the infrastructure that it requires, and it can get pretty complex. There’s ways to compile it and put together, if you have a proper database, there’s different solutions out there where literally you can select application A and you can see that its dependent applications are D, E, F, and G, and they’re on server instances 1, 3, 7, and 9, and they require X, X, X and Y of storage, and here’s what it is that you literally bundle together, and know that that bundle is what you need in your disaster recovery site.
Data Center Spotlight: It doesn’t sound like it’s any less complex than your primary IT infrastructure.
Jeff Gilmer: It’s not, but Kevin, I will tell you, and don’t take this wrong, but there are many organizations that don’t even have their primary production mapped out to even begin to identify what’s required for disaster recovery.
Data Center Spotlight: All right, well, I’ll try not to be offended by that, Jeff. [LAUGH] Let’s talk about the DR data centers themselves, and really the infrastructure in those data centers. I’ve been in some data centers, some DR data centers, where everything just looks sort of past its prime, and a joke I’ve heard from multiple people, when in some of these DR data centers is, hey, the 80s called, and they want their data center back. On the other hand, I’ve been in a couple of DR data centers that seemed to mirror the company’s production facility, and neither one of those seems optimal to me, Jeff. How should companies approach the technology requirements of their disaster recovery data center?
Jeff Gilmer: Yes, that’s very true in the range that you’re discussing, and everything in between. I think from our experience, the real issue is, is they focus on that facility, and they focus the site of the facility, and how far away it should be, and what’s our network capacity there, and what’s the distance, and we’re in a hurricane area, so we need to get out of this, so is it in a tornado region, or is it in the country? Is it in the city? And you really need to start with the technology before you start with the facility.
The key to understanding is what is required for that facility, and having the right facility, a facility that is right sized and is of the right quality, and many times tmore important than where it’s located, what type of network connectivity do I have? What’s the geography of it? And not that those aren’t important, they are very important, and we can talk about those a little bit more here in a minute, but what’s really required is, is once you have those services that’ve talked about map back to those applications, and you have those application dependencies, and now you have that mapped back to your infrastructure, you can clearly define how many physical servers, and server instances you need in that secondary site.
How much storage do I need, and what type of storage do I need? How much network connectivity do I need, and what type of bandwidth do I need, and type of redundancy do I need for that network? And probably most important is, what am I going to need for staffing? How much demand am I going to have on my resources to recover from that data center? And people miss the resources, even simple things like, do I need to provide a security guard at the shipping dock, if we have a disaster, for new equipment that’s going to arrive? Do I need to provide people to do racking and stacking and construction of physical hardware? Let alone the fact that those people are trying to run the day to day to operations, and they’re trying to recover certain applications. So, resources are probably the biggest one that people miss, and one of the most significant, you need to spend the time. Now, once you have a clear roadmap of what you need for servers, storage, network, resource requirements, now you can go look for a facility, and try to define that facility.
Data Center Spotlight: And Jeff, you’re bringing us back from the infrastructure, to the facility that houses that infrastructure, and again, in my experience, and you’ve got a lot more than I do, but my experience, the future capacity and facility requirements of a disaster recovery data center, those plans for the facility itself don’t always seem as well thought out and properly planned as the production data center. Your people have sort of a fine toothed comb for what they need from their production data center as far as the physical facility that houses the infrastructure, but not necessarily for DR data center. Do you find that to be the case?
Jeff Gilmer: Yeah, again, they’ve had a failure, they’ve had an instance, somebody in an executive level says, we need to have the ability to recover, so they may look at terms of how cost-effective it is, and they look at how convenient it is, they may find, oh, here’s something wholesale that’s under market on 25 cents on the dollar, rather than taking that infrastructure that you’ve now defined, which came from your mapping of your applications, which came from understanding your business services, a very logical format here, and going out with that and finding a data center that has that true capacity, that has the redundancy, and redundancy in a data center, well, a lot of people may think of, oh, like power from two different sites. It’s more the fact of, do I have network connectivity? You realize, your production data center can fail, but you can still get email on your phone from a secondary data center today, right? Network bandwidth, there’s lot of flexibility with disaster recovery sites that we didn’t have 10 years ago. They look at the geography, they look at the distance, they look at the location.
From that perspective, distance is one we get asked a lot. Distance, some people go by mileage, a lot of times we’ll say a minimum of 11 miles, but if it’s a hurricane, if you’re in a hurricane area, if you’re in the southeast or Gulf of Mexico corridors, 11 miles isn’t going to do much in the case of a hurricane. You’re going to need a greater geographical distance there. 75 miles is kind of a good secondary one. We’ll look at that, and the reason we look at 75 miles is, again, resources, getting your people to that recovery data center, that’s going to be your difficult issue here, is getting people there to get everything functioning and up and operating in that facility. Well, 75 miles in most cases, you can get to at a secondary data center, if there’s a major regional outage, without having to use public transportation, without having to fly in through the airport. They realize, if there’s a major disaster, the airport is the first place that’s going to be clogged, and then your trains, and your buses, and other things, there are all going to be issues. So, that’s another thing to consider. And then look at the location. Make sure that that location is not a high risk location. One thing that we are seeing quite a bit today out there in the data center world, Kevin, is what do I do with my aging data center? I bet you get asked that question frequently.
Data Center Spotlight: Yeah, yeah, the whole issue of refresh is certainly a big deal.
Jeff Gilmer: Yeah, and what we’re finding is, financially, they’ve got maybe some significant amount of money still on the books in deprecation or other factors that the CFO is not going to want to take a write-off on. So what we’re also finding is that they’ve outgrown the capacity, not necessarily physical space, but power demand, very high dense power, or high dense cooling. And they to want to get out of that current data center and move their production to another location. Great, but you know what, that facility is on your current campus today, so a great scenario that we’re seeing is, move that production off campus. Locate it 75 miles away. If your campus, everything goes down, your people can still go to their home PC, they can use their phone, and they can access that production data center that’s up and running 75 miles away, as long as you still have connectivity. Now, let’s say that your production data center goes down, and you need to recover from your disaster recovery site. Well, if you’re reusing that old data center that’s on your campus, again, your biggest issue is getting people there, your people are on that campus. Your people are there to function and operate and bring up the equipment from a disaster recovery standpoint right there at that location. Great way to reuse your old, aging data center, a great way to reuse your assets, you don’t have to take a financial hit. Your resources and your people are there, and yet, by moving your production off site, it’s probably less risk than having it on site, because now there’s two sites that your business would have to take a hit on before you’d be nonfunctional.
Data Center Spotlight: Yeah, I mean, that’s a great solution, Jeff, and also, a big issue with people having an onsite data center, an on campus data center, is frequently security, so you’re sort of solving the security issue by having maybe a new purpose-built site for your new data center, and you can take care of all those shortcomings while still solving the real estate issue of your old data center, or what to do with it.
Jeff Gilmer: Yeah, absolutely, and another piece if you want to take it a step further is, a lot of times, we’ll put the test dev environment in that test recovery data center, knowing that that test dev equipment can be migrated to DR equipment as a backup. Now, you’re not going to put your primary load on your test dev equipment, and what we mean by primary, I’ll give you a little bit of statistics here. Most companies are seeing somewhere between 15% and 17% of their infrastructure and applications need to be in an active-active setting, in that setting to recover within 24 hours or less from some sort of active-active environment. We’re seeing another 15% to 20% that typically have to follow up in 48 hours to recover. Now, that additional 48 hours could be test dev equipment, that virtual instances are converted over, allowing you to recover. And anything beyond that is really, 5 days or longer, where you’re going to order new equipment if you truly have a failure. So when you’re constructing that recovery data center, we’re really talking somewhere in the 30 to 35% of your infrastructure to really be in your DR site. You can have some of that be test dev to be brought up another time, and if test dev is on your campus, it allows you to have your resources there, or some of the flexibility to bring up and bring down sessions fairly quickly to allow your test dev to function, but also the ability to convert them over in the case of a disaster.
Data Center Spotlight: Jeff, that’s some interesting stuff, and I know we can’t talk about some of the specific things you do for clients, but I think some of those creative solutions you just shared with us speaks to the experience that you have and some of the complexity of the situations that you get into with your clients. If someone wants to reach out to you to discuss these matters further, what’s the best way to get in touch with you, Jeff?
Jeff Gilmer: Well, the easiest way to access my contact information is our website, www.excipio.net. We also have links to these podcasts and others, links to different seminars and areas where we’re doing presentations all across North America, that people are invited to attend. And we also have several different sorts of case studies, client references, and solution suites out there that will give you some additional information.
Data Center Spotlight: It’s a pretty resource heavy website, and Jeff, that’s how people can reach out to you. Let’s just wrap this up, what’s a couple of things you’d like people to take out of our discussion today that are most important?
Jeff Gilmer: Well, the biggest thing that we get asked is, how do I know my data center recovery plan, my DR plan is actually going to work, that I’m going to be able to recover? And it goes back to that word, plan. You really need to plan and use objective steps along the way. Objective steps of understanding, what are the critical things that we have within our business center organization? And you have actual concrete drivers out there that determine if something’s critical. Regulatory issues, compliance issues, whatever it may be. And from there, come to realistic recovery times. Realistic meaning, not what, who screams the loudest really wants, but what does the business really need, and what are we capable of doing from both a financial and a resource standpoint? From there, you can identify the technology resources required, you can identify the facilities that are required, and you can put in place a true disaster recovery plan that’s going to function for your business.
Data Center Spotlight: Okay, well terrific, Jeff. This is a very informative session, as they all are, but just some of the complexities you’ve brought up, and some of the creative solutions you’ve brought up I find to be pretty interesting.
Jeff Gilmer: There’s always more to find, right?
Data Center Spotlight: That’s right. I look forward to our next visit Jeff, thank you so much, and again, folks can reach out to you at excipio.net if they’d like to engage with you.
Jeff Gilmer: Great, Kevin. Always a pleasure to speak with you, Kevin.
Data Center Spotlight: All right. Thanks, Jeff.