Kubernetes Monitoring Best Practices: A Comprehensive Guide

Application and infrastructure monitoring is key to your organization’s reliability and uptime—you want to detect and resolve issues early to minimize downtime and performance issues. Over the years, several monitoring best practices have been developed as a basic blueprint to help inform this process.

When organizations fail to implement monitoring best practices, the result is often poor resource utilization, a diminished user experience, and longer resolution times for troubleshooting efforts. Because Kubernetes is distributed and often ephemeral, establishing good monitoring practices is even more critical to provide insight into resource utilization, performance issues, and potential outages.

This article focuses on the best practices for monitoring Kubernetes. We review five monitoring best practices and then look at examples of implementing them using Kubernetes-related tools. This is part of a multi-pronged approach to getting the best overall view of your environment.

Summary of essential Kubernetes monitoring best practices

Best practice	Why it’s important	How to put it into practice in Kubernetes
Identify and define key metrics	Defining key metrics helps you separate the signal from the noise, so you can identify when processes stray from the norm.	Use Prometheus to collect key metrics, including those related to CPU, memory, network, and disk I/O.
Plan how to handle alerts	Alerts can become noise when they are not planned for, so it’s important to know who should receive them and what should happen when they are received.	Implement Prometheus alerting rules to integrate with Alertmanager for proper alert delivery.
Monitor trends with dashboards	Visual data is more straightforward to interpret and gives quick insights for diagnosing problems.	You can create dashboards in Grafana for real-time metric monitoring across clusters, nodes, and pods.
Establish data retention policies	Balance storage costs with data accessibility to keep relevant data for troubleshooting and analysis.	Configure Prometheus or other monitoring solutions, such as Thanos, with appropriate data retention settings. Archive older data when necessary.
Integrate logging alongside metrics	Metrics alone may not provide a complete picture when debugging complex issues. Logs and traces help identify and troubleshoot specific events and microservice interactions.	Implement centralized logging (e.g., Loki) to supplement metrics and give more detailed insights across services and applications.

Environment setup

In this article, we use MiniKube, a local distribution of Kubernetes, for demonstrations and install our monitoring stack using Helm.

Prerequisites for the demos:

Once we have everything installed, we use a two-node MiniKube setup. Assuming you have MiniKube configured, you can run the following to set up a two-node environment using Docker as the backend for the VMs.

				
					minikube start --nodes=2 --cpus=2 --memory=4g --driver=docker

Verify that MiniKube is running using the following commands.

$ minikube status

				
					minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured

minikube-m02
type: Worker
host: Running
kubelet: Running

$ kubectl get nodes

				
					NAME           STATUS   ROLES           AGE   VERSION
minikube       Ready    control-plane   28d   v1.31.0
minikube-m02   Ready    <none>          28d   v1.31.0

You can use the version sub-command to verify that MiniKube is installed for Helm.

$ helm version

				
					version.BuildInfo{Version:"v3.16.4", GitCommit:"7877b45b63f95635153b29a42c0c2f4273ec45ca", 
GitTreeState:"dirty", GoVersion:"go1.23.4"}

Now MiniKube and Helm are installed and verified.

Automated Kubernetes Data Protection & Intelligent Recovery

Perform secure application-centric backups of containers, VMs, helm & operators

Use pre-staged snapshots to instantly test, transform, and restore during recovery

Scale with fully automated policy-driven backup-and-restore workflows

Identify and define key metrics

Key metrics offer the most insight into the health and performance of your Kubernetes environment. They are the metrics that matter. They often include node metrics such as CPU, memory, and disk I/O usage, and Kubernetes-specific metrics such as pod restarts, availability, and etcd performance.

Defining key metrics can be very specific to an application or organization. If you are just starting, Google has outlined four golden signals that can guide defining key metrics:

Latency measures the time it takes to process requests—the system’s responsiveness to incoming requests. In Kubernetes, this could be measured by how long it takes for the API to respond to requests or for pods to spin up.
Traffic measures the volume of requests entering the system. In Kubernetes, this could mean measuring traffic entering the cluster or between pods.
Errors in Kubernetes environments could include 5xx errors from the application, pods failing to start up, or the application continuously restarting, which would indicate instability.
Saturation indicates resource usage near capacity, such as disk I/O or CPU/memory utilization being near limits. In Kubernetes, one example would be to see if the disk IOPS (the measurement of how many reads/writes per second a disk can do) is sufficient for the environment. This is especially critical for the disk where the etcd database runs.

These golden signals provide a comprehensive framework for monitoring your Kubernetes environment.

Install Prometheus

Let’s look at a practical example by installing the Prometheus stack and examining saturation metrics, particularly CPU usage. The Prometheus stack includes Prometheus for metrics collection, Alertmanager for alerting, and Grafana for visualization—all essential tools for implementing effective Kubernetes monitoring.

Start by adding the Prometheus Helm repository.

				
					version.BuildInfo{Version:"v3.16.4", GitCommit:"7877b45b63f95635153b29a42c0c2f4273ec45ca", 
GitTreeState:"dirty", GoVersion:"go1.23.4"}

Next, install the Prometheus stack using Helm.

				
					helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Now verify that the pods were created. You should see a display similar to the following. It may take a few minutes for all the pods to show their status as “Running.”

				
					% kubectl get pods -n monitoring
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          3m4s
prometheus-grafana-5cc9c65675-fmfpk                      3/3     Running   0          3m33s
prometheus-kube-prometheus-operator-6654795697-47g9d     1/1     Running   0          3m33s
prometheus-kube-state-metrics-587d44d5fd-tck8t           1/1     Running   0          3m33s
prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          3m3s
prometheus-prometheus-node-exporter-5426b                1/1     Running   0          3m33s
prometheus-prometheus-node-exporter-x7m2s                1/1     Running   0          3m33s

By default, Prometheus runs on port 9000. For our setup, we will port-forward port 9000 to access the Prometheus UI locally. This command will allow you to go to http://localhost:9000 and access the UI.

				
					$ kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090

Keep this terminal window open as long as you want to access the UI. Port forwarding will end as soon as you close the terminal window.

Remember that this configuration is just for testing. You would expose this service via an ingress point in a production environment.

Watch this 1-min video to see how easily you can recover K8s, VMs, and containers

Example: CPU usage query

Our next step is to query data from the nodes. Navigate to http://localhost:9090 and copy/paste the following query into the text input box. This query measures the average CPU usage per instance over 5 minutes.

				
					100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Let’s break down this query piece by piece, so we know what we are looking at:

node_cpu_seconds_total{mode="idle"} measures the total time in seconds the CPU spends idle
rate(...[5m]) calculates the per-second rate of change over a 5-minute window.
avg by (instance) calculates the average per instance (node).
… * 100 converts the number to a percentage.
100 - … subtracts the previous number (percentage idle) from 100%, giving the percentage used.

This example shows how to track CPU data over time. To get more information, navigate to the Graph or Explain tabs from this window.

Prometheus metrics for the four golden signals

The next step is to build on this demo by creating Prometheus queries that align with the four golden signals. Here are some relevant Prometheus metrics for the four golden signals:

Latency:
- http_request_duration_seconds_bucket
- apiserver_request_duration_seconds_bucket
- istio_request_duration_milliseconds

Traffic:
- http_requests_total[5m]
- container_network_receive_bytes_total
- apiserver_request_total

Errors:
- http_requests_total{status=~”5..”}[5m]
- Http_requests_total[5m]
- kube_pod_container_status_restarts_total

Saturation:
- node_memory_MemTotal_bytes
- node_memory_MemAvailable_bytes
- node_memory_MemTotal_bytes
- kubelet_volume_stats_used_bytes

Prometheus collects an overwhelming number of metrics. Visiting http://localhost:9090/api/v1/targets/metadata returns JSON-formatted metadata about all of the metrics. This illustrates how many metrics there are; you would not be expected to review this data for available metrics.

To learn more about the metrics, you can begin typing in the expression box and have it autocomplete based on available metrics, or you can review the exporter’s documentation.

Each exporter (application that exports Prometheus data) has its own document, including Node Exporter, Kube-State-Metrics, cAdvisor (used by kubelet), and etcd. Once you know the metrics you’re interested in, you can further use PromQL (Prometheus Query Language) to write queries, create dashboards, and set up alerts based on this data.

Plan how to handle alerts

Once you’ve collected your key metrics, the next step is to decide how to act on them, and this is where alerts come in. The key idea here is that clarity is everything. Each alert should identify four things:

What is happening
What action to take
Who should get the alert
How should they be notified

The first question to ask is: “Is this alert relevant?” Without answering this question, teams can be flooded with so much noise that important alerts are missed, a phenomenon known as alert fatigue that leads to slow response times and missed incidents.

Another way to look at an alert’s relevance is to consider what would happen if the alert condition is missed or persists for an extended period. Would it be not noticeable, lead to performance degradation, or cause an outright outage?

Alert categorization

To help prioritize alerts, they are often categorized into severity levels, usually SEV1 (highest) to SEV4 (lowest). While the criteria for each level may differ between teams and organizations, the main idea is that SEV1 alerts are critical issues needing immediate action, while SEV4 issues are the lowest priority and may not require immediate action.

The table below describes these categories and provides examples of each.

Severity	Criteria	Examples	Kubernetes-Specific Examples
SEV1	Major outage affecting most or all users No workaround available Immediate attention required	Complete site outage Backend database down Business-critical operations halted	Kube-apiserver down Critical pods in a CrashLookBackup state Etcd unhealthy Ingress controller down
SEV2	Major functions impaired Partial outage Significant performance degradation Workaround may exist	Latency causing degraded service Regional outages Elevated error rates in key systems ~50% of users experiencing issues	Nodes in one availability zone in NotReady state PVCs are stuck in Pending HPA is stuck and not scaling out
SEV3	Limited or isolated impact Non-critical systems affected Workarounds available	Some pages loading slower than normal Intermittent errors	Non-critical pods crashing or restarting ConfigMap updates not triggering a deployment rollout
SEV4	Low-impact issue No business impact Cosmetic issues	Slow report generation UI types or styling issues Delays in logging	Minor alerts flapping Typo in a Helm chart description Delayed logging delivery to the central log store

You can start by defining alert thresholds, such as when memory exceeds 90% for more than five minutes or when a pod restarts more than three times in five minutes. This process will evolve over time as you learn what normal usage is for your environment.

Once an alert fires, the next step is to decide what to do. Every alert should have clear and actionable steps. If there is no action to take, then there is no reason for it to be an alert.

An example of clear steps would be logging on to the server and running a series of predetermined commands to identify which processes use the most memory. Based on what is discovered, processes could be killed, resources scaled up, etc. For Kubernetes specifically, a useful alert would be if a pod restarts multiple times over a short period. This means a pod is trying to restart but has a persistent issue preventing it from running. The response could be to review logs, increase resource limits, or even cordon off the node for deeper investigation.

The next thing to identify is who should get the alert and how they should be notified. Each alert should have an owner: a team or individual directly responsible for resolving it. For example, a manager may be responsible for the system’s overall health, and sending alerts to the engineering team would facilitate a quicker response than sending them to the manager, who will ultimately send them to the engineering team.

Finally, consider how the owner should be notified. There are many tools for informing a team or individuals; the key is to decide what works for your situation. The notification could be an email, Slack message, PagerDuty notification, etc.

Learn about the features that power Trilio’s intelligent backup and restore

Example: Pod restarts

Let’s use pod restarts as an example. We can create an alert for pod restarts with a threshold of three times in five minutes. This alert could be configured to send a Slack message to an engineering team and provide instructions for which cluster or pod to log into and what commands to run to troubleshoot.

These practices can be implemented in the Prometheus stack we deployed earlier for Kubernetes. Alerting, as we’ve described it here, is managed through Alertmanager. It handles defining the threshold, identifying who should be notified, and specifying how they should be informed.

The graphic below shows how this practice maps to our Prometheus stack in Kubernetes. A PrometheusRule defines the alert and when to fire. Prometheus will forward this alert to Alertmanager, which will manage who receives it and how it’s received.

Let’s see this in action.

Verify that AlertManager is running

If you configured the Prometheus stack using the Helm charts, this should already be set up. The following command confirms that the AlertManager pods are running.

				
					$ kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          36m

Port Forward the Alert Manager UI

Run the following command in another terminal window.

				
					kubectl port-forward svc/prometheus-kube-prometheus-alertmanager -n monitoring 9093

Visit https://localhost:9093 for the Alert Manager UI and verify that it is accessible.

Create a PrometheusRule

Let’s create a PrometheusRule from the command line. Save the following YAML text to a file called high-cpu-alert.yaml. It makes a rule that will alert when CPU usage exceeds 1%. (We use 1% in our demo to ensure that an alert fires.)

				
					apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: high-cpu-alert
  namespace: monitoring  # Replace if you installed Prometheus in a different namespace
  labels:
    release: prometheus  # This must match your Helm release name
spec:
  groups:
  - name: node.rules
    rules:
    - alert: HighCPUUsage
      expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 1 
      for: 2m
      labels:
        severity: warning
        severity_level: sev2
      annotations:
        summary: "High CPU usage on instance {{ $labels.instance }}"
        description: "CPU usage is above 1% for more than 2 minutes on {{ $labels.instance }}. Run top to investigate.

One thing to note in this PrometheusRule is that we are applying two labels here: severity: warning and severity_level: sev2. These labels are important because we can filter based on these labels. Here is where you might tie in the SEV1-SEV4 categorizations, along with anything else that makes sense for your environment.

Apply and Verify the PrometheusRule

				
					$ oc apply -f cpu-high-usage-rule.yaml
$ kubectl get prometheusrules -n monitoring high-cpu-alert -o yaml

Visit https://localhost:9093 and type severity_level="sev2" into the filter field. The alerts may take a couple of minutes to appear, so refresh if you don’t see something like the following at first.

You should see the alerts filtered out below.

Configure receivers

Now that alerts are firing, it’s time to configure how they’re delivered. Receivers in Alertmanager handle this. A receiver defines where alerts should go, whether that is an email address, a Slack channel, or a custom webhook. In this section, we focus on setting up an email receiver so critical alerts are delivered straight to your inbox, enabling you to take action quickly and effectively.

Since we installed the Prometheus stack using Helm, we will configure Alertmanager by updating the deployment through a custom-values.yaml file. This file defines the email receiver settings. Once configured, Helm will apply these custom values during a redeployment.

Start by saving the following file as custom-values.yaml. Replace the SMTP settings with those from your email provider; this could be Gmail or another SMTP provider. Finally, replace the [email protected] email with the email where the alerts should go.

				
					alertmanager:
  config:
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager@example.com'
      smtp_auth_password: 'example-password'
      smtp_require_tls: true

    route:
      group_by: ['alertname']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'default-email'
      routes:
        - matchers:
            - severity_level="sev2"
          receiver: 'sev2-email'

    receivers:
      - name: 'sev2-email'
        email_configs:
          - to: 'sev2-team@example.com'
            send_resolved: true
            subject: '[ALERT] {{ .CommonLabels.alertname }} - {{ .Status }}'
            text: |
              Alert: {{ .CommonLabels.alertname }}
              Status: {{ .Status }}
              Severity: {{ .CommonLabels.severity_level }}
              Summary: {{ .CommonAnnotations.summary }}
              Description: {{ .CommonAnnotations.description }}

              Affected instance(s):
              {{ range .Alerts }}
                - {{ .Labels.instance }} ({{ .Labels.job }})
              {{ end }}

    inhibit_rules:
      - source_matchers:
          - 'severity_level="critical"'
        target_matchers:
          - 'severity_level="sev2"'
        equal:
          - 'alertname'

The table below breaks down the major sections of the configuration.

Section	Purpose
global	Defines global settings like SMTP server details used across all alerts.
route	Routing tree for alerts, including grouping, timing, and default receiver.
routes	Nested under route, defines sub-routes to match and direct specific alerts to different receivers.
receivers	Defines where alerts are sent: email, Slack, webhook, etc.
inhibit_rules	Suppresses alerts based on the presence of higher-severity alerts with matching labels.

Upgrade the Prometheus Helm install

Use Helm’s upgrade operation and the custom-values.yaml file to apply the new configuration.

				
					helm upgrade prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring -f custom-values.yaml

Restart the pods

Load the new configuration by restarting the AlertManager stateful set.

				
					kubectl rollout restart statefulset 
alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

With all this in place, Alertmanager will now route alerts to your inbox.

Monitor trends with dashboards

Alerts are firing and being delivered, so next, we’ll focus on visualizing trends by building a dashboard. Visualizing metrics over time is one of the best ways to learn and understand your systems’ health and behavior. While alerts identify specific issues, dashboards give you the bigger picture. They highlight gradual changes, larger patterns, or slow-burning problems that alerting alone might not make evident.

As mentioned before, setting the right alert threshold takes time and requires historical context to evolve. Dashboards make this process easier. When metrics trend over hours, days, or weeks, it’s much easier to recognize that something is starting to shift out of the norm. For example, if memory usage on a node increases steadily, that might not trigger an alert immediately, but it’s still a trend worth investigating before it becomes critical.

Dashboards are highly customizable. Different teams can create views tailored to the metrics they care about most, allowing everyone to have quick access to the most relevant and actionable information.

In our Kubernetes setup, we use Grafana to create these dashboards. Grafana should already be installed as part of the Prometheus stack if you’ve followed the steps above.

Example: Creating a custom usage monitoring dashboard

Now we’ll walk through creating a custom dashboard to graph CPU and memory usage for each node in the cluster.

Configure port forwarding to access the UI

As before, we need to configure port forwarding to access the Grafana UI.

				
					kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80 &

Visit http://localhost:3000 to view the Grafana UI.

For the kube-prometheus stack, use the default username and password of admin and prom-operator.

Create the first dashboard

From our configuration, Prometheus should be configured as a data source, so we can immediately create our first dashboard.

Add visualizationAs this is a new dashboard, click on Add Visualization and select Prometheus as your data source.

Select the Prometheus data source.

In the left-hand panel, where it prompts for a PromQL query, enter the following query to graph CPU usage. This measures the amount of time the CPU is idle and subtracts it from 100, giving us CPU active time.

				
					100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Once you enter the query, click Run Queries.

In the right-hand panel, you can give it a title such as “CPU active by node.” Further down, you can adjust the visual thresholds and show them on the graph.

You can repeat these steps and graph memory usage for each node using the following query.

				
					(1 - (avg by (instance) (node_memory_MemAvailable_bytes) / 
avg by (instance) (node_memory_MemTotal_bytes))) * 100

With both visualizations, the dashboard might look something like this.

Grafana alone offers much to learn and explore. These two queries can serve as a starting point.

Establish data retention policies

As you continue to collect data, the amount of space you need also increases. Every second, Prometheus scrapes metrics from nodes, pods, and applications, generating millions of time series data points over time. Without a plan or policy for data retention, this data can quickly outgrow your storage, degrade performance, and complicate troubleshooting efforts.

Establishing a clear data retention policy helps strike a balance between insight and efficiency. On the one hand, you need enough historical data to analyze trends, investigate incidents, and generate meaningful reports. On the other, retaining too much high-resolution data for too long drives up storage costs, especially in cloud environments. It can also negatively impact Prometheus’ startup time and query performance.

Retention policies also help enforce compliance with GDPR, HIPAA, and SOC 2 regulations, which may require strict data handling and deletion timelines. Beyond compliance, good policies help reduce the risk of long-term data corruption and allow teams to forecast and control infrastructure needs more accurately.

Another benefit of retention policies is the opportunity to optimize how data is stored. For example, Prometheus recording rules can be used to downsample metrics, transforming high-frequency data into lower-resolution summaries (e.g., 5-minute averages from 1-minute samples). This approach retains long-term insight while reducing the need for so much storage.

In the next section, we’ll walk through how to define and apply these retention settings in your Kubernetes monitoring stack.

We can use the steps from earlier to update our custom values for Prometheus using Helm and a custom values file. Save the following to a custom-values.yaml file.

				
					prometheus:
  prometheusSpec:
    retention: 3d
    retentionSize: 1GB

As before, go ahead and update the values, redeploy the Helm chart for kube-prometheus-stack, and restart the rollout for the pods.

				
					$ helm upgrade prometheus prometheus-community/kube-prometheus-stack \                                  
  --namespace monitoring -f custom-values.yaml

$ kubectl rollout restart statefulset prometheus-prometheus-kube-prometheus-prometheus -n monitoring

In a larger production environment, or when you need to scale your long-term data retention policies, consider looking at tools like Thanos. Thanos extends Prometheus by enabling long-term storage and metric downsampling. It can also grow with your environment, with high-availability configurations for larger clusters.

Integrate Logging Alongside Metrics

To round out our best practices, application logging is the last piece to put in place.

Metrics provide a high-level view of what is happening in your cluster e.g. CPU usage, memory usage, or pod restarts. Logs help uncover the “why.” They can provide the context needed to understand what your applications were doing when an issue occurred.

For example, metrics might show a pod failing to start, but logs will show whether it is due to a misconfiguration or a runtime error. Together, metrics and logs offer a complete picture.

This is where Loki, a log aggregation system developed by Grafana Labs, fits in. Loki is designed to work seamlessly with Prometheus and Grafana. It uses the same label-based approach and indexes only metadata (labels), not the full log content, making it efficient and cost-effective for Kubernetes environments.

Setting up Loki and integrating with Grafana

In this section, we’ll walk through setting up Loki in your cluster and integrating it into Grafana, so you can search logs, correlate them with metrics, and gain even deeper insight.

Install the Loki Stack (Loki + Promtail)

We will add a second Helm repo, this time for Grafana.

				
					helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

You should now have two Helm repos.

				
					helm repo list  
NAME                    URL                                               
prometheus-community    https://prometheus-community.github.io/helm-charts
grafana                 https://grafana.github.io/helm-charts

Install loki-stack. This will install Loki (the logging backend) and Promtail, the log collector.

				
					helm upgrade --install loki grafana/loki-stack \
  --namespace monitoring \
  --create-namespace \
  --set loki.enabled=true \
  --set promtail.enabled=true

Verify that the Loki pods are running

Run kubectl get pods -n monitoring | grep loki to confirm that the pods are running. This may take a few minutes to come up.

				
					kubectl get pods -n monitoring | grep loki

loki-0                                                   1/1     Running   0          53m
loki-promtail-kg7k2                                      1/1     Running   0          53m
loki-promtail-tmtg9                                      1/1     Running   0          53m

Configure Loki as a data source and redeploy Prometheus

Add the following content to the custom-values.yaml file we have been working on.

				
					grafana:
  additionalDataSources:
    - name: Loki
      type: loki
      uid: loki
      url: http://loki:3100
      access: proxy
      isDefault: false
      jsonData:
        maxLines: 1000

Run the following command to redeploy the Prometheus deployment.

				
					helm upgrade prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f custom-values.yaml

Example: Viewing logs collected by Promtail

Logging into our Grafana dashboard from before, we should see now that Loki is a preconfigured data source. For our demo, we are going to look at some of the logs it has been collecting from the kube-system that have the word “error” in them.

From the left-hand panel, click on Explore:

Enter the following query into the query field and click on Run Query:

{namespace=”kube-system”}|= “error”

Logs collected from the nodes will be populated under the logs section. From here, you can begin to explore with the query language and visualization options.

In our setup, Loki connects what-happened with why-it-happened. The integration in the toolset means you can explore logs right in Grafana without switching tools. Having logs and metrics side by side enables faster and more informed troubleshooting.

Learn about a lead telecom firm solved K8s backup and recovery with Trilio

Bringing it All Together

As you can see, monitoring a Kubernetes environment is not a small or quick task. The right strategy and tools make it more manageable and effective. By stacking open-source tools like Prometheus, Grafana, and Loki, you can build a powerful monitoring stack that meets and grows with your cluster’s needs.

In this article, we started by identifying key metrics using Google’s four golden signals as a guide. From there, we defined clear actions for when metrics exceed thresholds. Alerting via Slack, email, or triggering an automated response ensures that your monitoring is informative and actionable.

Dashboards offer visibility, and Loki offers context. Together, they enable faster debugging and clearer insights.

Kubernetes offers flexibility, and so does the monitoring stack. While Prometheus is the default for scraping metrics, other tools like Vector and Thanos can provide different approaches depending on your environment.

Following these best practices and adapting the examples provided will provide you with a framework for building, extending, and evolving a monitoring setup that grows with your cluster and your team.

Table Of Contents

Like This Article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Products

Trilio for Kubernetes

Trilio for OpenStack

Why Trilio

Continuous Recovery & Restore

Kubernetes Ransomware Protection

BACKUP FOR RED HAT OPENSHIFT

BACKUP FOR RED HAT OPENSHIFT VIRTUALIZATION

BACKUP FOR RED HAT OPENSTACK

BACKUP AND RECOVERY

VMWARE TO RED HAT OPENSTACK MIGRATION

DISASTER RECOVERY

APPLICATION MOBILITY

VMWARE TO RED HAT OPENSHIFT VIRTUALIZATION MIGRATION

KUBERNETES WORKLOAD MIGRATION

TELECOM PROVIDERS

FINANCIAL SERVICES

AI Data Protection

Become a Partner

TECHNOLOGY PARTNERS

DISTRIBUTORS

CLOUD PROVIDERS

SOLUTION PROVIDERS

RESELLERS

Tutorials

Customer Support

Case Studies

Newsletters

Press Releases

Podcasts

Video & Demo

White Papers

OpenStack Backup and Recovery

Kubernetes Backup and Recovery

Red Hat Virtualization

OVirt Backup and Recovery

About Trilio

Contact us

Reference Guide: Optimizing Backup Strategies for Red Hat OpenShift Virtualization

Kubernetes Monitoring Best Practices: A Comprehensive Guide

Summary of essential Kubernetes monitoring best practices

Environment setup

Automated Kubernetes Data Protection & Intelligent Recovery

Identify and define key metrics

Install Prometheus

Watch this 1-min video to see how easily you can recover K8s, VMs, and containers

Example: CPU usage query

Prometheus metrics for the four golden signals

Plan how to handle alerts

Alert categorization

Learn about the features that power Trilio’s intelligent backup and restore

Example: Pod restarts

Monitor trends with dashboards

Example: Creating a custom usage monitoring dashboard

Establish data retention policies

Integrate Logging Alongside Metrics

Setting up Loki and integrating with Grafana

Example: Viewing logs collected by Promtail

Learn about a lead telecom firm solved K8s backup and recovery with Trilio

Bringing it All Together

Like This Article?

Products

Solutions

Legal

Let’s Connect!