Key Concepts and Best Practices for OpenShift Virtualization

Transcript:

Pete Wright:
Hello everybody and welcome to Trilio Insights on TruStory FM, I’m Pete Wright. In episode two of this show, Rodolfo Casas introduced us to the OpenShift mindset. Back then, we promised he’d be back to dig into some more detail on just how Trilio for OpenShift and OpenShift Virtualization data protection get the job done. And today we make good on that promise with the features you can’t miss.
Rodolfo.

Rodolfo Casás:
Hello.

Pete Wright:
So good to talk to you again.

Rodolfo Casás:
Yes, it is so nice to talk to you again.

Pete Wright:
I’m excited to talk about this because I feel like it’s part two of a conversation you and I started now months ago. We’re going to be talking about the features you can’t miss for Trilio for OpenShift and OpenShift Virtualization data protection, and there are some pretty stunning things on this list.

Rodolfo Casás:
Yes, we thought that it would be very interesting to make a list of topics that come up regularly on our conversations with customers and prospects. Over time, you start figuring out what customers react to, if you know what I mean, and what we think is very important, so that’s why we built our product with this philosophy, and also what they think it is more important for them. Because at the end of the day, maybe you don’t need all features. You’re just looking for the feature that, “Oh, I have to buy this feature. I need this feature in my cluster.” Okay? So in my infrastructure. So that’s what we’re talking about today.

Pete Wright:
Okay. So as a point of introduction, why don’t you tell us a little bit about Trilio for OpenShift and the significance for those customers?

Rodolfo Casás:
Yeah. So Trilio and Red Hat have been partners for many years now starting with OpenStack environments, and it’s been a great partnership. We also worked together with Red Hat Virtualization, and then when Kubernetes started up they… Well, even before Kubernetes, they started with OpenShift and then they brought Kubernetes to OpenShift. So OpenShift is like an opinionated version of Kubernetes.
And lately, one of the big things that is coming is OpenShift Virtualization, which is running virtual machines inside OpenShift. Now, through this partnership, we’ve built extra integrations and extra features for Red Hat OpenShift. Okay, so many of the features we have for every Kubernetes distribution and some of the topics we’re going to discuss today, we are agnostic about the distribution. So it will work on any Kubernetes distribution actually, but some of them are OpenShift virtualization-focused or integration with, for example, Ansible automation or Ansible automation platform, Advanced Cluster Manager from Red Hat. So yes, some are agnostic, some are Red Hat-focused.

Pete Wright:
Let us begin with the OpenShift console plugin. Feature number one.

Rodolfo Casás:
Yeah, the same day we read that we could have some custom dashboard for Trilio in the OpenShift console, we thought, “We need to do this because we’ve done it for Red Hat OpenStack and also for Red Hat Virtualization.” So now Trilio’s customers, when they go to the OpenShift UI, they can see a tab that says Trilio Backups. So just besides Virtualization tab or Operators tab, they will see the Trilio Backups tab.
So I think this is really good for any OpenShift customer, but it is especially nice for customers that are going to build OpenShift standalone virtualization platforms. So if a customer is just going to run virtual machines on top of OpenShift, because it is really easy to click a backup button inside the OpenShift UI without going to the Trilio UI or any other tool, if I have to do something very quick because I’m going to upgrade operating system or do some critical change to a VM or a set of VMs and I can say, “Backup, put it here, execute,” and that’s it. You don’t need to worry about anything else. Really simple UI inside OpenShift console.
So that’s the use case for this. Start to work with backups and data protection from day one with no specific knowledge or you don’t need big skills to do this.

Pete Wright:
And it’s I think the first block in the brick in the wall of speed. So many of the things we’re talking about today are speed and efficiency, and just having a clean user interface that you can manipulate quickly and easily, step number one.

Rodolfo Casás:
We forget about making things easier. And when I show our demo to customers, they say, “Oh, that’s nice. It was really simple to do that.” And yeah, we sometimes overlook simplicity as a feature, but it is.

Pete Wright:
Absolutely. Absolutely. Feature number two, we want to talk about operator data protection. What are we talking about here?

Rodolfo Casás:
Yes. So there are many ways to deploy applications in OpenShift. Not a lot of time ago I created a reference architecture and it ended up being a document saying different ways of consuming Trilio. So I just explained one of them. One is the OpenShift console plugin. Okay? Or you could integrate Trilio with your GitHub’s pipelines or you could use a Trilio UI or you could use automation.
Now, there’s a way of deploying applications in OpenShift, which is operators, okay? Kubernetes operators, OLM operators. And what I see when I talk to customers is that they are the big forgotten. Customers ask to me about persistent volumes. They all want to know if we protect metadata. They all want to know… Well, some of them want to know if we protect container images, but no one usually asks about operators. And operators, they get in a lot of traction in the Kubernetes world and ecosystem.
So the problem is, if you are using operators, and most of Red Hat customers are doing that, they use 3scale, they use Keycloak, they use Data grid, whatever operator. The problem is if you are not following best practices, which actually is having all the operators configuration in a Git repository, and that allows you that if you are doing an operator upgrade and it fails, you could reinstall and get all your configuration back. But guess what? When I talk to people, customers, Red Hat support, they tell me, “Listen, 95 or even 99% of customers don’t do, follow best practices.”
So the only way to protect yourself against a fail upgrade of any operator is using data protection. And as far as I know, we are the only one doing that. We have a customer with 900 clusters or using a very specific operator, and they have that gap. So using automation with Ansible and Trilio, now they can protect a lot of operative clusters with Trilio. That’s a problem, and that’s the solution.

Pete Wright:
Yeah. Yeah, that’s amazing. And also not surprising that 99% of customers are not using best practices in managing these things.
You put the thing I think I’m probably most enthusiastic about right in the middle. So this feature is stunning when you see the data. We’re talking about low recovery time objective, RTO.

Rodolfo Casás:
Yes.

Pete Wright:
Tell the people why I love this so much.

Rodolfo Casás:
Yes, we discussed this on the last episode of the podcast, but I think it’s such an important feature now that OpenShift Virtualization is coming. A lot of customers are asking for migration of VMs to OpenShift Virtualization and deploying clusters just for OpenShift Virtualization. And what that means is that we are going to have hundreds or thousands of VMs inside OpenShift Virtualization. That means we are going to have big persistent volumes with plenty of data inside those persistent volumes.
Now, there are replication strategies like we discussed last episode, which is they’re fantastic for very low RTO or almost zero RTO and RPO, but there are situations when you have to recover from a clean backup. And then you have a problem because if you have hundreds of VMs and you have hundreds of persistent volumes with hundreds of gigabytes or even terabytes, the bottleneck will be restoring those backups from your backup target, be it NFS or S3. It’s going to take hours or days to recover all those workloads, all those backups to your cluster.
Now, what we do with our continuous restore feature is we pre-stage job backups, okay? We will create volume snapshots in the destination cluster. So in the case you need to restore from a backup, the data is already there. We’ve been replicating asynchronously in the background. So for example, yesterday I was doing a blog post and I was taking backups of a VM. It had three disks, almost 150 gigabytes. So I took a backup to Amazon S3 and I was doing continuous restore to ROSA bare-metal cluster. The restore from the S3 bucket, it was taking a long time, which was expected, but the restoring from my continuous restore backups, it took less than three minutes. I was three minutes.

Pete Wright:
It’s just you needed like a rim shot. That’s extraordinary.

Rodolfo Casás:
Yeah. Yeah. It’s because the data is already there, so it solves a lot of problems. So it is going from recovering your VMs during the lunch break to, “Okay, I will take the weekend to recover my VMs.” That’s the difference, which it is huge.

Pete Wright:
Crazy. That’s amazing. Let’s talk about application- level encryption.

Rodolfo Casás:
Yes. So normally, customers, if they want to encrypt their backups, they will have a storage solution, which will normally have one encryption key for all the backups, and that’s okay. But if you want to do it better, what we see in the industry is that sometimes certain customers, they have multiple software vendors or software companies deploying applications and to their OpenShift or Kubernetes clusters. And with that, the problem is that they will have a third backup storage and a third private key. So any team could see the backups from the other team. And if you don’t like that, you need to encrypt the backups with different backup keys, encryption keys.
You can do this straight from Trilio. So you can provide, you can enable your self-service customers to use their own private keys, bring your own key, and only then can use those keys and open those backups with those encryption keys. That’s how it works. We are providing some isolation between teams so they don’t necessarily see what each other is doing with regards to backups.
For example, let’s say I’m a health insurance company. I’ve seen that before. And I have three system integrators and they’re deploying products and applications on different namespaces. So there are different teams that are connected to my cluster, and that’s okay. Kubernetes level or OpenShift level, they’re isolated, they don’t see each other. But when it comes to backup time, they’re putting the backups in the same place and using the same private encryption key. And that is the place where there is some third backups, and we don’t want that. If you don’t want that, this usually resonates a lot, especially for banking institutions, isolation. They like that. That’s one example.

Pete Wright:
Regulated healthcare.

Rodolfo Casás:
Yes.

Pete Wright:
Banking. Yeah, financial. Okay. All right. Continuing our application conversation, protecting application container images. This is unique.

Rodolfo Casás:
Yes, yes. That’s a feature we also have. We thought it was interesting a lot of time ago, and I always discuss it with customers. Some customers when I start talking about this, they say, “Oh yes, this has happened to me.” And only very recently, as the industry is maturing, only very recently, as recently as yesterday, I didn’t have to explain what the problem was. A customer, a prospect asked me, “Listen, do you protect container images? Because what happened to us is we had our image registry in the same data center as our Kubernetes cluster.” I said, “Oh, shit.” So the problem is they couldn’t restore their applications to another data center. It took them a long time to push those images to the disaster recovery cluster, and then they had to start restoring.
Now, this is what we do. What we do is we don’t do a mirror of your image registry. First of all, because when I explain, these customers ask me if that’s what we do, we don’t do that. Usually if you’re using an image registry like Key.io or any other registry, they will have their own mirroring options or backup and restore options. That’s not what we do. Okay?
What we do is application-level backup. So if you have, for example, a WordPress application with some MySQL database, it’s using two container images. And alongside your metadata and your data, we will take those two container images and we will put that besides your backup. So if you need to restore that application, you have the option to also restore the images. And when do you need to restore the images? That’s the question. And I can tell you, for example, two situations.
One situation is, for example, what happened to this customer. In that situation, they could have easily created a temporary image registry, restore the images from Trilio Backup, and as part of the normal restore workflow, it is very easy to do that in the UI, and then the restore workflow will start restoring the application because the images are there, so the images could be pulled. So that’s situation number one.
And situation number two is when you have to do some security forensic analysis, or for some auditing purposes, you have to restore an application from four or five, six months ago. And the problem with that is that probably your image registry will not have those older images. Probably you’ve done some pruning, some cleaning up, and your old images will not be in your registry. So now you cannot restore older versions of your application. But with Trilio you can, because you can protect everything. So you can restore an application for four months ago. If you need to meet that regulatory compliance or for whatever reason, you can’t do it. It’s not impossible, we do it.

Pete Wright:
Integration with Ansible AAP and ACM. Explain to me the importance of these integrations.

Rodolfo Casás:
Oh, yes. That’s I think one of the other big topics, integration with automation tools. So we provide out- of-the-box integration with these tools because we are cloud native tool. We run inside Kubernetes and OpenShift. So whatever your automation tool, whatever you are used to, you can use to manage Trilio’s backup strategies. So you can use automation to do two things.
First of all, you can deploy your whole configuration of Trilio to your thousand clusters easily with your automation tool. And then after that, your configuration, for example, if it says take a backup every hour or every day, it will just happen. Okay? So that’s like enabling the backup strategy at scale. You don’t want to go cluster by cluster enabling namespace per namespace backup configuration. No. With automation tools, you can do that very simple.
We have a lot of demonstrations using Ansible, for example, or Ansible Automation Platform. We even have an Ansible 35 collection, 35 by Red Hat. So this is option one, enable all your backup strategy in one go. And of course with that, you have the possibility of modifying your backup configuration at any time just running the automation again. So you can modify to your needs your backup configuration at any time without having to do it one by one. It doesn’t make sense.
And the second thing you might be interested is, okay, I want to take a backup of this number of clusters, this number of namespaces right now, for whatever reason, let’s say you have to do a very fast data center migration. Okay? So yeah, you can do that. You can have some playbooks or some automation to take backup of hundreds of namespaces. Now you just run them. This includes migration. You can do migration, end-to-end with your automation tools. You can do disaster recovery with automation tools. You can do disaster recovery testing, the disaster recovery fail back. It’s up to you. We enable this flexibility and this scalability.
And one of the best I’ve seen is integration with advanced cluster manager from Rafay. Just with two basic policies. One, we’ll install Trilio’s operator in all your clusters, and the second one we take care of the backups. That’s all. And the goal is that it doesn’t matter how many clusters you are going to add to ACM, we will take care of everything, and that’s it.

Pete Wright:
What’s your sense of stability when it comes to automation? I’ve run into executives who get frustrated with some of the integration tools, automation integration tools. They say it’s automation rot, it works for a month, it works for a year, and then suddenly something breaks and it becomes unstable. What’s your comment on stability as a feature of these integrations?

Rodolfo Casás:
Oh yeah, that’s a complex topic because automation tools get upgrades and it might break things. But for example, for Ansible, they have quite nice tools to have a proper QA and testing. So usually all big enterprises, they will have their own DevOps departments and their automation departments, and they will have their DevOps, their dev clusters, and clusters where they can test everything. So they have a rock solid automation. Yes. You cannot just do one automation playbook or whatever and forget whatever. I don’t think that’s feasible. You have to be looking at it continuously. That’s what I think.

Pete Wright:
Yeah, no, I mean, it sounds like it. The whole idea of automation rot really is an indicator that you probably don’t have a DevOps QA team that can actually keep these things running. But it’s important to note that once you put these automations in place, you got to keep them going. You got to keep them clean.

Rodolfo Casás:
Yeah, of course there’s some maintenance. Yeah, for sure.

Pete Wright:
Yeah. Yeah, sure. This has been a fantastic look what we’ve done here. What you have built out is not only some fantastic features of Trilio and these integrations, but a checklist. We’ve built a checklist of the things enterprise customers need to go ask about, how to get things done to deliver the backup solution that works.

Rodolfo Casás:
Yes, I agree with you. I have conversations of all types. I have conversations when a customer is still very immature and you have to guide them on what we do or what is this whole thing about. All the customers come to you and they have already been talking to another data protection solution. So they come to you with very precise questions. This is what I want. I want to bring it the other way around. So if anyone is going to talk to any backup vendor out there, just listen to this podcast, listen to all the topics I just put on the table, and then you ask them about this. Let’s see what happens.

Pete Wright:
Oh, Rodolfo has thrown down the gauntlet. Obviously, we get very excited about this stuff, but we hope that you’re listening to this…

Rodolfo Casás:
Yeah, of course. Yeah. I apologize.

Pete Wright:
Yeah, no, it’s great. Rodolfo and I, we hope you are as excited about this stuff as we are because clearly we’re nerds about it and we love talking about it, but they are truly, the numbers behind these things are truly astounding. So check it out, ask other vendors, see what you think. But we hope you get something out of it. So thank you everybody for listening. Thank you, Rodolfo…

Rodolfo Casás:
Thank you very much.

Pete Wright:
… for bringing your wisdom once again. And we’re going to put some links in the show notes. We’ll put links to resources that will help you learn more about this stuff. As far as I’m concerned, this is an official start here episode, right? We’ll just call it start here. If you want to learn about these can’t-miss features, it’s really, really great. So we appreciate you downloading, listening to the show. We appreciate your time and your attention. Just swipe up in the show notes for all those links. And on behalf of Rodolfo Casas, I’m Pete Wright. We will see you next time right here on Trilio Insights.