Reference Guide: Optimizing Backup Strategies for Red Hat OpenShift Virtualization

Downtime Disasters: The High Cost of  Downtime and the Quest for Resilience

Transcript:

Pete Wright:
Businesses rely on the uninterrupted flow of data and the constant availability of their applications. So what happens when the unexpected strikes and the systems we depend on grind to a halt? From high-profile outages that capture national attention to countless unseen daily disruptions, IT systems and application downtime has become an unavoidable reality.
The consequences can be far-reaching from financial losses and damaged reputations to the erosion of customer trust and employee morale. But what steps can companies take to minimize these risks and build resilience in the face of an ever-evolving landscape of threats?
Today we sit down with David Safaii, Executive Chairman here at Trilio to explore the causes, consequences and solutions to one of the most pressing challenges facing businesses in the digital age. I’m Pete Wright, and this is Trilio Insights.
David, welcome to the show. It’s about time.

David Safaii:
Well, thank you for having me. I feel like this is a I should say long-time listener first-time caller.

Pete Wright:
I’ll take that, I’ll take that.

David Safaii:
Looking forward to this.

Pete Wright:
We’re talking about downtime today, and I wonder if we could start with just an overview of the landscape of the downtime issues that are facing our systems. Before we get a sense of where Trilio fits into this space, what are the trends you’re seeing in terms of frequency and severity?

David Safaii:
You’re starting with a tricky one, because the public really sees and hears about catastrophes, right?

Pete Wright:
Right.

David Safaii:
But outages happen more than you think. Sometimes it may be a network hiccup or a specific app or an entire cloud or cluster. It may just be in relation to a poor upgrade cycle or a product rollout. So it’s sort of like how long is a piece of string? I’m kidding of course, but that’s really the case. And the common causes I would say, they vary.
So you could have hardware failure, network outage, storage failure, security breaches, DDoS attacks, power outages, natural disasters, right? We’re in global weirding. There are physical threats of all of these items. Candidly, it’s the human element that is more common than not. Things like software bugs, ransomware attacks and insider threats, accidental resource deletions, or even playbooks that went wrong, misconfigurations and things of that nature.

Pete Wright:
As we record this, it was just I think yesterday that all of the big AI players went down for a while, forcing millions and millions of people to use their brains for a few hours.

David Safaii:
Oh God.

Pete Wright:
It’s shocking. Shocking, I tell you. Do we have statistics on these areas? Do we know what is the biggest purveyor of these downtime instances?

David Safaii:
Yeah, so let’s just look at cyber attacks as an example, right?

Pete Wright:
Okay.

David Safaii:
I’ve read a report that said in 2022, the average length of an interruption after a ransomware attack was about 24 days.

Pete Wright:
24 days?

David Safaii:
It’s a long time, right? And 93% of enterprises admitted to a breach that was suffered. Because that breach, you have unplanned downtime, you have data exposure, you have financial loss. The compounding element is a financial element that could result in fines. You get leaked records and data that FCC had handed out over 100 million in fines to US carriers alone.

Pete Wright:
Wow. So, we’re getting into some of the kind of costs and consequences. I noticed downtime when Netflix is out. That’s usually some Amazon something has gone south and many, many sites are offline. How do you, from the perspective of Trilio, calculate and frame the discussion of outage for your customers?

David Safaii:
A few years ago, Facebook and its family of applications had six hour global outage due to a configuration issue.

Pete Wright:
A configuration issue, okay.

David Safaii:
Right. The estimated downtime, the ad revenue that was lost was $100 million.

Pete Wright:
In six hours.

David Safaii:
Right. So that’s $17 million per hour, that’s like quarter million per second.

Pete Wright:
Yeah.

David Safaii:
Poof, gotten popping. And that’s extreme, but ultimately the costs vary depending upon size of the organization, how extensive the outage is, how long it lasts, and there’s a number of other factors.

Pete Wright:
It makes me think about the cost to me when my services that I rely on are out and I can’t work. Right? That is a force multiplier when you think about just the income that companies are bringing in that is lost when services are down, but I can’t bring in income either, right? That’s a direct impact and significant.

David Safaii:
Oh, Pete, I tell you, Siemens put out a report that said the Fortune 500 lost about 1.5 trillion a year in unplanned downtime. Mind-blowing figure, right?

Pete Wright:
Wow.

David Safaii:
These are dollars that we’re talking about. Beyond those staggering numbers, the downtime effects you can have on one hand a company that is losing revenue, but more importantly, you’ve got organizations that have precious lives at stake. And I think that’s kind of lost in some of this conversation too, because defense and protecting our people in the field require that mission-critical applications are available and recoverable to space and ensuring that astronauts can come home safely.

Pete Wright:
Yeah, right. All stories that have been in the news this year, all of them.

David Safaii:
To here on earth, we’re protecting the hospitals, non-recovery is not tolerated. We’ve heard of a hospital that had downtime of 15 to 20 hours. In addition to the people in that building, you have to turn people away from the outside and they can’t receive the treatment they require. That is the cost of downtime, and you can’t put a number on that.

Pete Wright:
We are in an era where we hear these stories all the time, and it feels like reputational damage is sort of on fast-forward. Are you seeing brand reputational damage done and sticking for companies that are struggling to keep their systems online? Is it making a difference or an impact?

David Safaii:
Absolutely. Everything, everything is built on trust. Relationships built on trust. Brands have consistency they care about. They make good on promises. They provide and maintain transparency where they can, and they look to build credibility through positive experiences and reliability. Customer satisfaction and trust creates loyalty and lifelong customers. When people like IDC says that over 30% of outages result in data loss, it’s 40% of disruptions that have led to some sort of brand reputation damage, and there’s a long tail to that.

Pete Wright:
Just this month, Google Cloud accidentally deleted a customer account and the client was UniSuper, and the number was 647,000 users who faced two weeks of downtime because of a cloud bug. I don’t even know, is there math that exists that’s big enough to calculate the overall impact and the ripples in the pond of the impact of that kind of a deletion?

David Safaii:
Well, that example, which is, it’s incredible, right? There are hard dollars and there are soft dollars. We talked about revenue, but also there’s to what you just mentioned, there’s loss in employee productivity, right? You just lost two weeks of productivity beyond the fact that you need to refocus and rebuild your environments. Do you need to procure new hardware? Is this a case of ransomware and you’re held hostage, you’re dealing with that, right? So it’s pretty incredible.

Pete Wright:
Well, on that note, let’s transition to mitigating our downtime risks, right? First things first, how many organizations are you talking to that have and regularly test their disaster recovery plans?

David Safaii:
Oof. So that’s a really good question because I can tell you, and people do not test enough. There’s a good Forrester report out there that says about 31% of organizations test their DR once a year. And that’s a scary metric. I would be salivating if I was a cyber criminal. Testing can be easy and you can automate it if you’ve got the right tools.

Pete Wright:
A little bit off the map of our conversation. Is there a difference? Do you notice a geographic distance, a globally geographic difference in Europe? I’m thinking specifically about the EU right now, and typically a more regulated environment. Do you find or do you know of an increase in awareness and application of disaster recovery techniques compared to western systems operators?

David Safaii:
Yeah, that’s a good question, and I’d say a lot of it’s led by, well, one regulation, and two culture. You’ve got certain places, so take Europe as an example where GDPR has teeth, and soon you’re going to see other compliance issues within the banking sector, such as DORA. And so people need to understand their liabilities and the holes that they have that they need to plug, or they could see some severe fines.
Now, when I talk about geography, you have some other people from a culture perspective who require backup and disaster recovery as a day zero or day one operation, because if you’re in South Korea, you’re always trying to prepare yourself, and that’s a scary thing too, and that’s just part of the culture. And so backup and disaster recovery is a very important thing to South Koreans.

Pete Wright:
My hunch is there are going to be people who listen to this who agree with you, who are inside system operators, and here, yeah, we probably don’t test it, our DR systems enough. This is the look of shame. So what should they be looking for? When you think about Monday morning 8 AM day one, how do we get started?

David Safaii:
You got to make sure you invest in an option that’s built for the technology you’re using. Legacy solutions are not going to cut it for cloud native applications.

Pete Wright:
Well, and this is interesting though, right? Because to the Google example, there’s every bit of a reason for operators to think, “Okay, well, if everything’s on Google Cloud, surely they’ve got their systems covered. This will never happen,” until 647,000 people get their user accounts deleted.

David Safaii:
It’s a huge fallacy. It’s incredible. The number of times I’ve had conversations about this where they say, “Oh, we just trust our public cloud provider.” They’re there to provide you with the infrastructure to run your applications. They’re not there to provide you with the application or data protection. That’s a level of granularity that you’re not going to receive, and they don’t know your applications like you do. They don’t know the data that needs to be protected.
When we’re talking about cloud native applications, it requires this intelligence and recovery. It can’t be a hammer. You’re trying to reorchestrate a point in time as quickly as possible. While the cloud providers provide you with plumbing to the house, you have to be the one to understand what fixtures that you want and how to guide the rest of the stuff inside the house. Applications are constantly changing. They’re constantly evolving, they’re constantly scaling. You understand your application. You need to have something that is application aware and solutions that are tailored in need to those environments.

Pete Wright:
The application awareness and protection, recovery awareness extends beyond just the technical sphere. We have these. We’ve already brought up reputational areas and we’re talking about the EU. We’ve talked a little bit about compliance areas and operational areas. So we have these arms, technological, operational, reputational and compliance for building out a disaster recovery plan. When you’re looking at organizations building out these systems, how important is it to equally address all of these, or are we just really focusing on the tech?

David Safaii:
We’re empowering the customer to have and enabling them to recover so they can achieve the level of resiliency that they require. So they may have service-level agreements that they need to adhere to, whether it’s internal or external. Expectation around these service levels really goes back to that trust comment that I made earlier in discussing brand SLAs provide guidelines on response times, resolutions time, overall service quality. And really in the end all the parties are aware of that the expectations and the outcomes and things like SLAs provide a good guide for managing the cost associated with the downtime you’re allowed to accept and all the procedures that you can want to wrap around it. It really comes down to the amount of risk that you’re willing to take on too.

Pete Wright:
In the hierarchy of the organization, who owns disaster recovery?

David Safaii:
Everyone. The conversation around disaster recovery and recovery in general, you’re starting to see this as a KPI in the boardroom because God forbid something does go down and there is a ransomware attack, can I recover? So while our conversations, a lot of our conversations have been at the managerial level or the architects, the edict for this goes much higher up.

Pete Wright:
What’s it looking like for us moving forward? What is our future outlook? Is it just ransomware all the way down?

David Safaii:
As a company that’s witnessed recovery of business critical applications firsthand during time of crisis with folks, making sure the right tools are in place. You wouldn’t give a scalpel to a lumberjack, would you, right? If you’re planning properly and you’re prepared, the more you put in, the more you’ll get out. And the best thing that happens is that if you plan and you test, my advice is to get smart, do research, understand that things like high availability does not mean disaster recovery. High available systems are ones that aim to stay online as often as possible, but downtime can still occur in highly available systems. Disaster recovery is a proactive plan of action, and it details how you can recover after a disaster has happened. So be ready, be prepared, be native.

Pete Wright:
Do you have a measure or metric for how to think about budgeting for disaster recovery?

David Safaii:
In budgeting for disaster recovery, so that’s a good question, I think you have to look at the applications in the parts of your environment that must recover the quickest. So going back to those SLAs that I talked about before, what are the mission-critical applications that cannot tolerate downtime and have to be up as fast as possible? I’m going to invest more time, more processes, more technology around that versus some of the other applications that do have a tolerance for some lag time to recover. And I think based upon that analysis, you’ll start to put budgets in place that reflect what is required for those environments.

Pete Wright:
I appreciate your time today, David. Thank you for teaching us.

David Safaii:
No, thank you very much. This was a lot of fun and I hope to do it again.

Pete Wright:
Absolutely. Absolutely. Thank you everybody for downloading and listening to this show. Where should we send people to learn more about this subject? I’m sure Trilio’s got some resources we need to send them to.

David Safaii:
We do. We do. We’ve been taking quite a bit of time in developing some excellent content for people because it’s an important topic, and I highly recommend people come to the website, trilio.io. And in there you’ll find a bunch of resources around blogs and videos, some great podcasts and a number of other resources. And candidly, even if you’re just curious, ask us a question. We’re there to help. We understand at the end of the day, we and our customers view us as partners in this journey, so we’re raising our hands and to say just ask.

Pete Wright:
I think that’s a great point. If you’ve never done this, if you’ve never invested in it, it’s okay not to know. It’s okay to ask.

David Safaii:
100%.

Pete Wright:
All right. Thank you all for downloading, listening to this show. We appreciate your time and your attention. Again, we encourage you to learn more. Just swipe up in the show notes for this episode and you will find all the links to the resources that we have mentioned that David has pointed out. On behalf of David Safaii, I’m Pete Wright and we’ll see you next time right here on Trilio Insights.