Application and infrastructure monitoring is key to your organization’s reliability and uptime—you want to detect and resolve issues early to minimize downtime and performance issues. Over the years, several monitoring best practices have been developed as a basic blueprint to help inform this process.
When organizations fail to implement monitoring best practices, the result is often poor resource utilization, a diminished user experience, and longer resolution times for troubleshooting efforts. Because Kubernetes is distributed and often ephemeral, establishing good monitoring practices is even more critical to provide insight into resource utilization, performance issues, and potential outages.
This article focuses on the best practices for monitoring Kubernetes. We review five monitoring best practices and then look at examples of implementing them using Kubernetes-related tools. This is part of a multi-pronged approach to getting the best overall view of your environment.
Summary of essential Kubernetes monitoring best practices
Best practice | Why it’s important | How to put it into practice in Kubernetes |
Identify and define key metrics | Defining key metrics helps you separate the signal from the noise, so you can identify when processes stray from the norm. | Use Prometheus to collect key metrics, including those related to CPU, memory, network, and disk I/O. |
Plan how to handle alerts | Alerts can become noise when they are not planned for, so it’s important to know who should receive them and what should happen when they are received. | Implement Prometheus alerting rules to integrate with Alertmanager for proper alert delivery. |
Monitor trends with dashboards | Visual data is more straightforward to interpret and gives quick insights for diagnosing problems. | You can create dashboards in Grafana for real-time metric monitoring across clusters, nodes, and pods. |
Establish data retention policies | Balance storage costs with data accessibility to keep relevant data for troubleshooting and analysis. | Configure Prometheus or other monitoring solutions, such as Thanos, with appropriate data retention settings. Archive older data when necessary. |
Integrate logging alongside metrics | Metrics alone may not provide a complete picture when debugging complex issues. Logs and traces help identify and troubleshoot specific events and microservice interactions. | Implement centralized logging (e.g., Loki) to supplement metrics and give more detailed insights across services and applications. |
Environment setup
In this article, we use MiniKube, a local distribution of Kubernetes, for demonstrations and install our monitoring stack using Helm.
Prerequisites for the demos:
Once we have everything installed, we use a two-node MiniKube setup. Assuming you have MiniKube configured, you can run the following to set up a two-node environment using Docker as the backend for the VMs.
minikube start --nodes=2 --cpus=2 --memory=4g --driver=docker
Verify that MiniKube is running using the following commands.
$ minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
minikube-m02
type: Worker
host: Running
kubelet: Running
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane 28d v1.31.0
minikube-m02 Ready 28d v1.31.0
You can use the version sub-command to verify that MiniKube is installed for Helm.
$ helm version
version.BuildInfo{Version:"v3.16.4", GitCommit:"7877b45b63f95635153b29a42c0c2f4273ec45ca",
GitTreeState:"dirty", GoVersion:"go1.23.4"}
Now MiniKube and Helm are installed and verified.

Automated Kubernetes Data Protection & Intelligent Recovery
Perform secure application-centric backups of containers, VMs, helm & operators
Use pre-staged snapshots to instantly test, transform, and restore during recovery
Scale with fully automated policy-driven backup-and-restore workflows
Identify and define key metrics
Key metrics offer the most insight into the health and performance of your Kubernetes environment. They are the metrics that matter. They often include node metrics such as CPU, memory, and disk I/O usage, and Kubernetes-specific metrics such as pod restarts, availability, and etcd performance.
Defining key metrics can be very specific to an application or organization. If you are just starting, Google has outlined four golden signals that can guide defining key metrics:
- Latency measures the time it takes to process requests—the system’s responsiveness to incoming requests. In Kubernetes, this could be measured by how long it takes for the API to respond to requests or for pods to spin up.
- Traffic measures the volume of requests entering the system. In Kubernetes, this could mean measuring traffic entering the cluster or between pods.
- Errors in Kubernetes environments could include 5xx errors from the application, pods failing to start up, or the application continuously restarting, which would indicate instability.
- Saturation indicates resource usage near capacity, such as disk I/O or CPU/memory utilization being near limits. In Kubernetes, one example would be to see if the disk IOPS (the measurement of how many reads/writes per second a disk can do) is sufficient for the environment. This is especially critical for the disk where the etcd database runs.
These golden signals provide a comprehensive framework for monitoring your Kubernetes environment.
Install Prometheus
Let’s look at a practical example by installing the Prometheus stack and examining saturation metrics, particularly CPU usage. The Prometheus stack includes Prometheus for metrics collection, Alertmanager for alerting, and Grafana for visualization—all essential tools for implementing effective Kubernetes monitoring.
Start by adding the Prometheus Helm repository.
version.BuildInfo{Version:"v3.16.4", GitCommit:"7877b45b63f95635153b29a42c0c2f4273ec45ca",
GitTreeState:"dirty", GoVersion:"go1.23.4"}
Next, install the Prometheus stack using Helm.
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Now verify that the pods were created. You should see a display similar to the following. It may take a few minutes for all the pods to show their status as “Running.”
% kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 3m4s
prometheus-grafana-5cc9c65675-fmfpk 3/3 Running 0 3m33s
prometheus-kube-prometheus-operator-6654795697-47g9d 1/1 Running 0 3m33s
prometheus-kube-state-metrics-587d44d5fd-tck8t 1/1 Running 0 3m33s
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 3m3s
prometheus-prometheus-node-exporter-5426b 1/1 Running 0 3m33s
prometheus-prometheus-node-exporter-x7m2s 1/1 Running 0 3m33s
By default, Prometheus runs on port 9000. For our setup, we will port-forward port 9000 to access the Prometheus UI locally. This command will allow you to go to http://localhost:9000 and access the UI.
$ kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090
Keep this terminal window open as long as you want to access the UI. Port forwarding will end as soon as you close the terminal window.
Remember that this configuration is just for testing. You would expose this service via an ingress point in a production environment.
Watch this 1-min video to see how easily you can recover K8s, VMs, and containers
Example: CPU usage query
Our next step is to query data from the nodes. Navigate to http://localhost:9090 and copy/paste the following query into the text input box. This query measures the average CPU usage per instance over 5 minutes.
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Let’s break down this query piece by piece, so we know what we are looking at:
node_cpu_seconds_total{mode="idle"}
measures the total time in seconds the CPU spends idlerate(...[5m])
calculates the per-second rate of change over a 5-minute window.avg by (instance)
calculates the average per instance (node).… * 100
converts the number to a percentage.100 - …
subtracts the previous number (percentage idle) from 100%, giving the percentage used.
This example shows how to track CPU data over time. To get more information, navigate to the Graph or Explain tabs from this window.
Prometheus metrics for the four golden signals
The next step is to build on this demo by creating Prometheus queries that align with the four golden signals. Here are some relevant Prometheus metrics for the four golden signals:
- Latency:
-
- http_request_duration_seconds_bucket
- apiserver_request_duration_seconds_bucket
- istio_request_duration_milliseconds
- Traffic:
-
- http_requests_total[5m]
- container_network_receive_bytes_total
- apiserver_request_total
- Errors:
-
- http_requests_total{status=~”5..”}[5m]
- Http_requests_total[5m]
- kube_pod_container_status_restarts_total
- Saturation:
-
- node_memory_MemTotal_bytes
- node_memory_MemAvailable_bytes
- node_memory_MemTotal_bytes
- kubelet_volume_stats_used_bytes
Prometheus collects an overwhelming number of metrics. Visiting http://localhost:9090/api/v1/targets/metadata returns JSON-formatted metadata about all of the metrics. This illustrates how many metrics there are; you would not be expected to review this data for available metrics.
To learn more about the metrics, you can begin typing in the expression box and have it autocomplete based on available metrics, or you can review the exporter’s documentation.
Each exporter (application that exports Prometheus data) has its own document, including Node Exporter, Kube-State-Metrics, cAdvisor (used by kubelet), and etcd. Once you know the metrics you’re interested in, you can further use PromQL (Prometheus Query Language) to write queries, create dashboards, and set up alerts based on this data.
Plan how to handle alerts
Once you’ve collected your key metrics, the next step is to decide how to act on them, and this is where alerts come in. The key idea here is that clarity is everything. Each alert should identify four things:
- What is happening
- What action to take
- Who should get the alert
- How should they be notified
The first question to ask is: “Is this alert relevant?” Without answering this question, teams can be flooded with so much noise that important alerts are missed, a phenomenon known as alert fatigue that leads to slow response times and missed incidents.
Another way to look at an alert’s relevance is to consider what would happen if the alert condition is missed or persists for an extended period. Would it be not noticeable, lead to performance degradation, or cause an outright outage?
Alert categorization
To help prioritize alerts, they are often categorized into severity levels, usually SEV1 (highest) to SEV4 (lowest). While the criteria for each level may differ between teams and organizations, the main idea is that SEV1 alerts are critical issues needing immediate action, while SEV4 issues are the lowest priority and may not require immediate action.
The table below describes these categories and provides examples of each.
Severity | Criteria | Examples | Kubernetes-Specific Examples |
SEV1 |
|
|
|
SEV2 |
|
|
|
SEV3 |
|
|
|
SEV4 |
|
|
|
You can start by defining alert thresholds, such as when memory exceeds 90% for more than five minutes or when a pod restarts more than three times in five minutes. This process will evolve over time as you learn what normal usage is for your environment.
Once an alert fires, the next step is to decide what to do. Every alert should have clear and actionable steps. If there is no action to take, then there is no reason for it to be an alert.
An example of clear steps would be logging on to the server and running a series of predetermined commands to identify which processes use the most memory. Based on what is discovered, processes could be killed, resources scaled up, etc. For Kubernetes specifically, a useful alert would be if a pod restarts multiple times over a short period. This means a pod is trying to restart but has a persistent issue preventing it from running. The response could be to review logs, increase resource limits, or even cordon off the node for deeper investigation.
The next thing to identify is who should get the alert and how they should be notified. Each alert should have an owner: a team or individual directly responsible for resolving it. For example, a manager may be responsible for the system’s overall health, and sending alerts to the engineering team would facilitate a quicker response than sending them to the manager, who will ultimately send them to the engineering team.
Finally, consider how the owner should be notified. There are many tools for informing a team or individuals; the key is to decide what works for your situation. The notification could be an email, Slack message, PagerDuty notification, etc.
Learn about the features that power Trilio’s intelligent backup and restore
Example: Pod restarts
Let’s use pod restarts as an example. We can create an alert for pod restarts with a threshold of three times in five minutes. This alert could be configured to send a Slack message to an engineering team and provide instructions for which cluster or pod to log into and what commands to run to troubleshoot.
These practices can be implemented in the Prometheus stack we deployed earlier for Kubernetes. Alerting, as we’ve described it here, is managed through Alertmanager. It handles defining the threshold, identifying who should be notified, and specifying how they should be informed.
The graphic below shows how this practice maps to our Prometheus stack in Kubernetes. A PrometheusRule defines the alert and when to fire. Prometheus will forward this alert to Alertmanager, which will manage who receives it and how it’s received.
Let’s see this in action.
Verify that AlertManager is running
If you configured the Prometheus stack using the Helm charts, this should already be set up. The following command confirms that the AlertManager pods are running.
$ kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 36m
Port Forward the Alert Manager UI
Run the following command in another terminal window.
kubectl port-forward svc/prometheus-kube-prometheus-alertmanager -n monitoring 9093
Visit https://localhost:9093 for the Alert Manager UI and verify that it is accessible.
Create a PrometheusRule
Let’s create a PrometheusRule from the command line. Save the following YAML text to a file called high-cpu-alert.yaml. It makes a rule that will alert when CPU usage exceeds 1%. (We use 1% in our demo to ensure that an alert fires.)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: high-cpu-alert
namespace: monitoring # Replace if you installed Prometheus in a different namespace
labels:
release: prometheus # This must match your Helm release name
spec:
groups:
- name: node.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 1
for: 2m
labels:
severity: warning
severity_level: sev2
annotations:
summary: "High CPU usage on instance {{ $labels.instance }}"
description: "CPU usage is above 1% for more than 2 minutes on {{ $labels.instance }}. Run top to investigate.
One thing to note in this PrometheusRule is that we are applying two labels here: severity: warning
and severity_level: sev2
. These labels are important because we can filter based on these labels. Here is where you might tie in the SEV1-SEV4 categorizations, along with anything else that makes sense for your environment.
Apply and Verify the PrometheusRule
$ oc apply -f cpu-high-usage-rule.yaml
$ kubectl get prometheusrules -n monitoring high-cpu-alert -o yaml
Visit https://localhost:9093 and type severity_level="sev2"
into the filter field. The alerts may take a couple of minutes to appear, so refresh if you don’t see something like the following at first.
You should see the alerts filtered out below.
Configure receivers
Now that alerts are firing, it’s time to configure how they’re delivered. Receivers in Alertmanager handle this. A receiver defines where alerts should go, whether that is an email address, a Slack channel, or a custom webhook. In this section, we focus on setting up an email receiver so critical alerts are delivered straight to your inbox, enabling you to take action quickly and effectively.
Since we installed the Prometheus stack using Helm, we will configure Alertmanager by updating the deployment through a custom-values.yaml file. This file defines the email receiver settings. Once configured, Helm will apply these custom values during a redeployment.
Start by saving the following file as custom-values.yaml. Replace the SMTP settings with those from your email provider; this could be Gmail or another SMTP provider. Finally, replace the [email protected] email with the email where the alerts should go.
alertmanager:
config:
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'example-password'
smtp_require_tls: true
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default-email'
routes:
- matchers:
- severity_level="sev2"
receiver: 'sev2-email'
receivers:
- name: 'sev2-email'
email_configs:
- to: 'sev2-team@example.com'
send_resolved: true
subject: '[ALERT] {{ .CommonLabels.alertname }} - {{ .Status }}'
text: |
Alert: {{ .CommonLabels.alertname }}
Status: {{ .Status }}
Severity: {{ .CommonLabels.severity_level }}
Summary: {{ .CommonAnnotations.summary }}
Description: {{ .CommonAnnotations.description }}
Affected instance(s):
{{ range .Alerts }}
- {{ .Labels.instance }} ({{ .Labels.job }})
{{ end }}
inhibit_rules:
- source_matchers:
- 'severity_level="critical"'
target_matchers:
- 'severity_level="sev2"'
equal:
- 'alertname'
The table below breaks down the major sections of the configuration.
Section | Purpose |
global | Defines global settings like SMTP server details used across all alerts. |
route | Routing tree for alerts, including grouping, timing, and default receiver. |
routes | Nested under route, defines sub-routes to match and direct specific alerts to different receivers. |
receivers | Defines where alerts are sent: email, Slack, webhook, etc. |
inhibit_rules | Suppresses alerts based on the presence of higher-severity alerts with matching labels. |
Upgrade the Prometheus Helm install
Use Helm’s upgrade operation and the custom-values.yaml file to apply the new configuration.
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
-n monitoring -f custom-values.yaml
Restart the pods
Load the new configuration by restarting the AlertManager stateful set.
kubectl rollout restart statefulset
alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring
With all this in place, Alertmanager will now route alerts to your inbox.
Monitor trends with dashboards
Alerts are firing and being delivered, so next, we’ll focus on visualizing trends by building a dashboard. Visualizing metrics over time is one of the best ways to learn and understand your systems’ health and behavior. While alerts identify specific issues, dashboards give you the bigger picture. They highlight gradual changes, larger patterns, or slow-burning problems that alerting alone might not make evident.
As mentioned before, setting the right alert threshold takes time and requires historical context to evolve. Dashboards make this process easier. When metrics trend over hours, days, or weeks, it’s much easier to recognize that something is starting to shift out of the norm. For example, if memory usage on a node increases steadily, that might not trigger an alert immediately, but it’s still a trend worth investigating before it becomes critical.
Dashboards are highly customizable. Different teams can create views tailored to the metrics they care about most, allowing everyone to have quick access to the most relevant and actionable information.
In our Kubernetes setup, we use Grafana to create these dashboards. Grafana should already be installed as part of the Prometheus stack if you’ve followed the steps above.
Example: Creating a custom usage monitoring dashboard
Now we’ll walk through creating a custom dashboard to graph CPU and memory usage for each node in the cluster.
Configure port forwarding to access the UI
As before, we need to configure port forwarding to access the Grafana UI.
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80 &
Visit http://localhost:3000 to view the Grafana UI.
For the kube-prometheus stack, use the default username and password of admin and prom-operator.
Create the first dashboard
From our configuration, Prometheus should be configured as a data source, so we can immediately create our first dashboard.
Add visualizationAs this is a new dashboard, click on Add Visualization and select Prometheus as your data source.
Select the Prometheus data source.
In the left-hand panel, where it prompts for a PromQL query, enter the following query to graph CPU usage. This measures the amount of time the CPU is idle and subtracts it from 100, giving us CPU active time.
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Once you enter the query, click Run Queries.
In the right-hand panel, you can give it a title such as “CPU active by node.” Further down, you can adjust the visual thresholds and show them on the graph.
You can repeat these steps and graph memory usage for each node using the following query.
(1 - (avg by (instance) (node_memory_MemAvailable_bytes) /
avg by (instance) (node_memory_MemTotal_bytes))) * 100
With both visualizations, the dashboard might look something like this.
Grafana alone offers much to learn and explore. These two queries can serve as a starting point.
Establish data retention policies
As you continue to collect data, the amount of space you need also increases. Every second, Prometheus scrapes metrics from nodes, pods, and applications, generating millions of time series data points over time. Without a plan or policy for data retention, this data can quickly outgrow your storage, degrade performance, and complicate troubleshooting efforts.
Establishing a clear data retention policy helps strike a balance between insight and efficiency. On the one hand, you need enough historical data to analyze trends, investigate incidents, and generate meaningful reports. On the other, retaining too much high-resolution data for too long drives up storage costs, especially in cloud environments. It can also negatively impact Prometheus’ startup time and query performance.
Retention policies also help enforce compliance with GDPR, HIPAA, and SOC 2 regulations, which may require strict data handling and deletion timelines. Beyond compliance, good policies help reduce the risk of long-term data corruption and allow teams to forecast and control infrastructure needs more accurately.
Another benefit of retention policies is the opportunity to optimize how data is stored. For example, Prometheus recording rules can be used to downsample metrics, transforming high-frequency data into lower-resolution summaries (e.g., 5-minute averages from 1-minute samples). This approach retains long-term insight while reducing the need for so much storage.
In the next section, we’ll walk through how to define and apply these retention settings in your Kubernetes monitoring stack.
We can use the steps from earlier to update our custom values for Prometheus using Helm and a custom values file. Save the following to a custom-values.yaml file.
prometheus:
prometheusSpec:
retention: 3d
retentionSize: 1GB
As before, go ahead and update the values, redeploy the Helm chart for kube-prometheus-stack, and restart the rollout for the pods.
$ helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring -f custom-values.yaml
$ kubectl rollout restart statefulset prometheus-prometheus-kube-prometheus-prometheus -n monitoring
In a larger production environment, or when you need to scale your long-term data retention policies, consider looking at tools like Thanos. Thanos extends Prometheus by enabling long-term storage and metric downsampling. It can also grow with your environment, with high-availability configurations for larger clusters.
Integrate Logging Alongside Metrics
To round out our best practices, application logging is the last piece to put in place.
Metrics provide a high-level view of what is happening in your cluster e.g. CPU usage, memory usage, or pod restarts. Logs help uncover the “why.” They can provide the context needed to understand what your applications were doing when an issue occurred.
For example, metrics might show a pod failing to start, but logs will show whether it is due to a misconfiguration or a runtime error. Together, metrics and logs offer a complete picture.
This is where Loki, a log aggregation system developed by Grafana Labs, fits in. Loki is designed to work seamlessly with Prometheus and Grafana. It uses the same label-based approach and indexes only metadata (labels), not the full log content, making it efficient and cost-effective for Kubernetes environments.
Setting up Loki and integrating with Grafana
In this section, we’ll walk through setting up Loki in your cluster and integrating it into Grafana, so you can search logs, correlate them with metrics, and gain even deeper insight.
Install the Loki Stack (Loki + Promtail)
We will add a second Helm repo, this time for Grafana.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
You should now have two Helm repos.
helm repo list
NAME URL
prometheus-community https://prometheus-community.github.io/helm-charts
grafana https://grafana.github.io/helm-charts
Install loki-stack. This will install Loki (the logging backend) and Promtail, the log collector.
helm upgrade --install loki grafana/loki-stack \
--namespace monitoring \
--create-namespace \
--set loki.enabled=true \
--set promtail.enabled=true
Verify that the Loki pods are running
Run kubectl get pods -n monitoring | grep loki to confirm that the pods are running. This may take a few minutes to come up.
kubectl get pods -n monitoring | grep loki
loki-0 1/1 Running 0 53m
loki-promtail-kg7k2 1/1 Running 0 53m
loki-promtail-tmtg9 1/1 Running 0 53m
Configure Loki as a data source and redeploy Prometheus
Add the following content to the custom-values.yaml file we have been working on.
grafana:
additionalDataSources:
- name: Loki
type: loki
uid: loki
url: http://loki:3100
access: proxy
isDefault: false
jsonData:
maxLines: 1000
Run the following command to redeploy the Prometheus deployment.
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f custom-values.yaml
Example: Viewing logs collected by Promtail
Logging into our Grafana dashboard from before, we should see now that Loki is a preconfigured data source. For our demo, we are going to look at some of the logs it has been collecting from the kube-system that have the word “error” in them.
From the left-hand panel, click on Explore:
Enter the following query into the query field and click on Run Query:
{namespace=”kube-system”}|= “error”
Logs collected from the nodes will be populated under the logs section. From here, you can begin to explore with the query language and visualization options.
In our setup, Loki connects what-happened with why-it-happened. The integration in the toolset means you can explore logs right in Grafana without switching tools. Having logs and metrics side by side enables faster and more informed troubleshooting.
Learn about a lead telecom firm solved K8s backup and recovery with Trilio
Bringing it All Together
As you can see, monitoring a Kubernetes environment is not a small or quick task. The right strategy and tools make it more manageable and effective. By stacking open-source tools like Prometheus, Grafana, and Loki, you can build a powerful monitoring stack that meets and grows with your cluster’s needs.
In this article, we started by identifying key metrics using Google’s four golden signals as a guide. From there, we defined clear actions for when metrics exceed thresholds. Alerting via Slack, email, or triggering an automated response ensures that your monitoring is informative and actionable.
Dashboards offer visibility, and Loki offers context. Together, they enable faster debugging and clearer insights.
Kubernetes offers flexibility, and so does the monitoring stack. While Prometheus is the default for scraping metrics, other tools like Vector and Thanos can provide different approaches depending on your environment.
Following these best practices and adapting the examples provided will provide you with a framework for building, extending, and evolving a monitoring setup that grows with your cluster and your team.

Like This Article?
Subscribe to our LinkedIn Newsletter to receive more educational content