Key Concepts and Best Practices for OpenShift Virtualization

From Worst-Case to Best Practice: Disaster Recovery Planning with Rodolfo Casás

Transcript:

Pete Wright:
From bustling financial districts to the quiet suburban offices, disaster can strike anywhere, anytime. But what happens when the unthinkable occurs? How are you supposed to bounce back from the brink of data collapse? Today we dive into the world of disaster recovery, exploring the critical differences between DR, backups, and high availability. We’ll guide you through the essential steps to build a robust disaster recovery plan with the help of our very own senior solutions architect, Rodolfo Casás. I’m Pete Wright, and this is Trilio Insights.
Rodolfo, I have to tell you that I come to this conversation with a little bit of maybe shame. Because when I was a kid, now that I’m old, I was working in a small office that was entirely operated by terminal services on a single server, and we did not have a robust disaster recovery plan. We didn’t know what it was. We lost the server. And man, that was painful. I feel like I’m not alone in coming to this conversation where you are to teach me everything that I should have known 25 years ago about this concept. Hello, sir. It’s food to see you.

Rodolfo Casás:
Hello. Nice to see you again. That’s a great start for this conversation.

Pete Wright:
I think so. It makes all the little hairs stand up on the back of my neck. The pain is so real and so present. So let’s start. Set the table for us. What are we talking about when we talk about these concepts? Let’s not assume anything. We don’t want to assume we’re all coming to this from the same perspective. I’m certainly not. So set the table on some definitions we need to understand before we start the conversation.

Rodolfo Casás:
Yes. So we are going to explain here different concepts. Backup is one of them. Mostly everyone understands that you need a backup at home, at your company, and you need that. But there are certain environments where you have certain workloads or certain services that are critical and you cannot allow anything to happen to them. So you need something that can support the failure of any of the components of that architecture. That’s called high availability.
If anything happens to any component, it will be automatically fixed or the traffic will be rerouted. That’s something it was very curious to me to learn the difference between the legacy solutions where we had multimillion dollar servers that were really expensive, they were never supposed to fail, and then we went to the cloud native world where you can have pizza boxes on rack servers. The philosophy changed to it will never fail to anything can fail at any time, anything.
So you have to bear that in mind, when you build your infrastructure, any component has to be redundant or you have to plan for anything, and then you get high availability.

Pete Wright:
Well, that’s what I was going to ask. Is redundancy synonymous with high availability now?

Rodolfo Casás:
Yes, for certain components it is. You need some monitoring solution and something needs to happen that will solve your problem. And then your application, your workload can continue serving its purpose.

Pete Wright:
So high availability where we have appropriate redundancy in place. We know that our systems are going to be running reliably because we’ve thought about all the components that can fail and probably will fail so that we can catch them when they do. And backups, I always go back to two is one, one is none in terms of backups, so we know that our data is being backed up somewhere so that we can recover if our data center falls into a lake.

Rodolfo Casás:
Yes. For example, if you have backups but you don’t have high availability, probably if you lose that server that you had in your company, you will have to buy another server and then reinstall the operating system, restore the applications, restore the backups, and then you can be up and running. I don’t know how long it took you, but maybe 12 hours, one day, one week. Now, if you have high availability setup, on some setups that I’ve seen, there can be no downtime at all.
And others they have 50 minutes or five minutes. It depends. Considering cost of the solution that you’re using or the infrastructure that you’re building, it is very important as well. So it also depends, the topic we’re going to cover later, how much downtime can you suffer without losing money, customers, reputation? Or how much data can you lose? So if you know your applications, you can determine those two things and then you can decide how much money are you going to put on the table to cover you for those failures.

Pete Wright:
So talking about now disaster recovery, it seems like what you’re talking about are a couple of different budgets in my head. There’s the budget for the actual monetary budget that we’re going to put toward being able to restore, and there’s the budget for uncertainty and downtime that we need to consider. And somewhere in the middle, there’s going to be a sweet spot for our business and we need to build a plan around that. How do you go about this process?

Rodolfo Casás:
Yes. So the difference between when high availability… High availability probably will not be able to cover for anything that can happen. So of course, there’s the usual, there’s a fire, there’s a flooding, there’s an earthquake. But sometimes it doesn’t need to be that. It can be a human error. If you read the statistics, most of the time it’s a human error and it happens everywhere.
AWS, they run something some years ago, wrong scope of financial playbook. I don’t know, they changed something that affected a lot of regions at the same time. So they wanted to do something… So a lot of things went wrong. So when high availability cannot cover your application, let’s say for example, let’s say you run your application in a perfect high availability environment in Amazon.
Everything is replicated everywhere. You have load balancers all over the place, but it is AWS itself that fails. Then high availability for you in that situation does not help. You need a disaster recovery solution to move to another cloud. And the same could happen if you’re running applications on premises.
And then you have everything redundant, everything well planned, everything well architected, but your on-prem data center loses energy, then you have to move somewhere else. And that is another disaster recovery situation. It can be hardware failure, network failure, security breach, human error, whatever, ransomware attack. There are a lot of situations where high availability could not be enough, and that is the difference between HA and DR.

Pete Wright:
What’s the impact of disasters and shall I say unprepared administrators?

Rodolfo Casás:
Yes. So for example, I have 28% of data breaches involve malware. 82% of breaches involve human error.

Pete Wright:
That seems to be an important one, 82% involve human error.

Rodolfo Casás:
Ransomware attacks can cost an average of 16 days of downtime, ransomware attacks. 35% of companies that suffer a ransomware attack, they lose 35% of their data. The average cost of downtime is $1,400 per minute, and that is average. And then 96% of businesses experienced an outage in a three-year period. So that’s everyone. Everyone will suffer downtime or an outage.

Pete Wright:
Everything fails at some point.

Rodolfo Casás:
Exactly. Yes.

Pete Wright:
Shall we move into building a disaster recovery plan?

Rodolfo Casás:
This is not something that, “Okay, I have my backups. I have disaster recovery.” Does not work like that. You have to perform risk assessment and business impact analysis. You have to understand your applications, your geographical architecture. You have to evaluate what your most critical applications are. You have to set disaster recovery plan objectives.
And usually a team inside a company or even the whole company, they need to understand that there is a disaster recovery plan. The day it happens, everyone has a document or the knowledge to execute the disaster recovery plan. It is a plan, so you have to plan for it. The word itself says it. And then you need budget. You need the budget to do this. And as I said, you said it, you need a sweet spot.
The balance to, okay, I can support, for example, two hours of downtime and two hours of data loss with the money I can pay. Or if you want to not lose any data and no downtime, you will have to spend more money. The disaster recovery plan has to be within your budget. And then very importantly, you cannot just plan. You have to test your plan. I mentioned this in a lot of calls with customers, you can automate your disaster recovery plans.
That’s the best. So for some enterprises, sometimes they have to do quarterly disaster recovery tests or every six months. So the faster they can do that and they can get a report of what went okay and what went wrong. For example, some banks, they have to be up and running in less than 24 hours, for example. Were you able to restore everything in less than 24 hours or not? And if they don’t pass, they don’t qualify. It’s just the law.

Pete Wright:
And now you’re talking about regulatory requirements for disaster recovery. This is what was hitting me as I was reading up on the subject before we started our conversation today. How do regulatory requirements impact institutions obviously that are regulated institutions like financial services, and how useful are regulations as a guideline to unregulated industries to help them build disaster recovery plans that are impervious to more potential troubles?

Rodolfo Casás:
Yes. So it is the regulatory requirements for different companies in different countries can be daunting. There’s a lot of regulation around. And usually customers come to you telling you, “I need to keep the backups for 10 years, or I need to be up and running in under two hours, or the law is forcing me to use at least two public clouds or to be able to get out of one public cloud to another,” so they have to use a hybrid cloud. It’s a regulatory requirement. And I will share the link with you afterwards, several links.
I find them very useful. One of them was from Microsoft, and they have a very nice page where they explain to you country by country what are the regulations for your country. And I find that really useful. There are certain regulations coming now. Well, everyone probably is aware of GDPR or DORA regulation for the financial services industry. So yeah, banks, for example, or healthcare industries, they also have very strict regulatory requirements.

Pete Wright:
Do you have a sense of a best in class, who’s doing it? And maybe it’s country by country, but in your work, do you see countries that are regulated most intelligently for disaster recovery? Who’s a role model?

Rodolfo Casás:
I think the European Union has put in a lot of regulation in place with regards to privacy and with regards to enforcing regulations into very critical industries like healthcare and banks. And I think other countries are following. I can’t say they’re the best, but I can say I think they’re living with this. And then I see other countries, “Hey, it is done in the European Union and it’s coming to our country as well. That’s what I hear.

Pete Wright:
Yeah, that’s my sense too, that whether or not the consensus is that the EU is doing it the best, they’re certainly the most thoughtful of it right now as a organization. They’re the most thoughtful.

Rodolfo Casás:
Yes, I agree. Yes.

Pete Wright:
Where do we go from here? I mean, when you’re looking at… I feel like I interrupted you in the middle of our disaster recovery plan conversation.

Rodolfo Casás:
One of the things I wanted to discuss is this confusing terminology. And with regards to Trilio, I wanted to explain how one of our features is helping with doing fast disaster recovery. And that feature we have is Continuous Restore. First, I need to explain about two concepts when we talk about disaster recovery. One is RTO, recovery time objective, and the other one is RPO. The difference between the two is RTO is how much time can you allow your applications to be down? And the second one, RPO is how much data can you lose?

Pete Wright:
What does RPO stand for?

Rodolfo Casás:
Recovery point objective. For example, you can have high availability and replication that can provide you with almost non-existent RTO or RPO. So for example, if you have two clusters and you enable some kind of storage replication, in case the cluster one fails for whatever reason and you have to do a disaster recovery to cluster two, yes, you will get very low RTO or RPO or even non-existent. You don’t lose any data and your applications have no downtime.

Pete Wright:
Even in a hybrid cloud environment, you’re switching from AWS to Azure.

Rodolfo Casás:
I couldn’t say. I will say, if we’re speaking about Kubernetes, it’s a bit more complex. But I mean, there are storage solutions out there that could do this. But the problem with replication, although it’s a fantastic solution. For example, you have OpenShift Data Foundation storage solution from Red Hat, and IBM, which they have Metro-DR and Regional-DR. They’re fantastic solutions. With Metro-DR, they’re doing synchronous replication between sites.
So you don’t lose any data and your application. If cluster one fails, cluster two kicks in and you don’t notice anything. It’s everything orchestrated. But it has certain prerequisites about latency and geographical distance. If you don’t meet the latency prerequisites, you can do asynchronous replication, which is very good as well. But at the end of the day, it’s more like a high availability solution in my opinion.
Because if something goes wrong in cluster one and if your volumes are corrupted into one side, they will be corrupted on side two. If that happens, then you need to recover from backup. And then Trilio is a backup solution for Kubernetes, but we just not recover from backup. And this is the whole point. We can recover from backup to do a disaster recovery very fast because we just not recover from backup.
We have this Continuous Restore feature. This is how it works. We take a backup from cluster one to the backup target. And once that backup is written, we start pre-staging the data from that backup on the secondary cluster. So we keep doing that with fulls and incrementals all the time, asynchronous, asynchronous. So backup and then restore, pre-restore, pre-stage from the backup on the other side.
Then if something bad happens, if we wouldn’t have that functionality, we would need to move all the data from the backup target to the cluster, and that takes a long time. But with our feature, the data is already on the target cluster. So we can recover really fast, really, really fast, because the data is already there. It’s just the orchestration. When a backup is written to the backup target, we have a polling mechanism every 60 seconds that will check if there’s a new backup, either full or incremental.
And if there is, Trilio will start pre-staging that data from that backup on the destination cluster. It will finish that. And if another backup happens, another consistent set will be created on the destination. So the thing is, when a disaster happens, any other tool, as far as I know, will have to copy the data from the backups in the NFS or S3 target to the cluster. Now, if we’re talking one gigabyte, two gigabytes, 10 gigabytes, that’s fine.
But if we start talking about big volumes or BMs with big disks, that’s going to take a long time. So let’s say you have disks of 200 gigabytes and you have 100 BMs. Now you have 20 terabytes to recover. How long is that going to take? A long time. So it will impact your RTO, your recovery time objective, heavily. So I think, yeah, high availability solutions are fantastic. But if you need to do disaster recovery, you have to do it as fast as possible. That’s Trilio’s approach, Continuous Restore.

Pete Wright:
And we talk about that sweet spot. I mean, you’re talking about Trilio can handle the big data sets.

Rodolfo Casás:
Yes, yes. At the moment, it’s not an extra license or anything. It’s included in the product. So you just can set up a backup only, or you can do Continuous Restore. And it is agnostic, multi-cloud, and hybrid. So for your question before, from AWS to AKS or AKS to GKE from [inaudible 00:19:34] whatever, you can set up… It’s not point to point. You can back up on pre-stage from one cluster to three if you want, or you can do a backup mesh and several clusters can be right in each other, if I can use that term.
Yeah, it’s agnostic. You can do between public clouds. So yeah, we cover a lot of regulations like for the financial services industry. How do you get out of with your cloud if something goes wrong? No problem. Trilio’s Continuous Restore, you can restore in another cloud in no time.

Pete Wright:
Well, if you’re listening to this, you’re a system administrator, you’re a architect, you got to go look at this stuff. And so we’re going to put a lot of links. So the links that Rodolfo pulled from for this conversation, we’re going to put in the show notes. We’re going to put links to Trilio’s resources on this subject, and we would love you to learn a little bit more about it. Disaster recovery, important stuff. Take it from those who have lost so much data. Please learn about disaster recovery. Thank you so much, Rodolfo.

Rodolfo Casás:
Thank you very much, Pete.

Pete Wright:
And thank you everybody for downloading and listening to the show. We sure appreciate you taking the time to do this. Definitely go learn more about the show. But the best thing that you can do to help us if you like what you’re hearing is to share the show with others. That’s the best way to help us and spread the good word of disaster recovery, high availability, and back-ups. On behalf of Rodolfo Casás, I’m Pete Wright, and we’ll catch you next time right here on Trilio Insights.