Kubernetes continues to be the most popular platform for orchestrating containerized applications and managing microservices at scale. By encapsulating each microservice within a container and deploying it as a pod, Kubernetes ensures isolation, resource efficiency, and fault tolerance. These pods are then distributed across worker nodes in your cluster while the control plane manages the overall orchestration.
However, as simple it may sound, harnessing the true power of Kubernetes requires more than just understanding its basic building blocks. To achieve optimal performance, security, and reliability, it’s essential to consider the core functionalities of the platform and how adopting best practices can help.
In this article, we discuss the core aspects of managing Kubernetes and how adopting best practices can help you build, deploy, and manage efficient applications.
Summary of key Kubernetes best practices
Area | Best practice |
Resource management | Segregate different environments into namespaces and configure resource limits. |
Pod health check probes | Configure health probes to check pods’ readiness for traffic and auto-restart failed pods. |
Config maps and secrets | Decouple configurations from application code and store secrets with a higher level of security. |
Network, application, and cluster security | Improve your security posture by configuring network and role-based access control, encrypting sensitive data on the wire and at rest. |
CIS Kubernetes benchmarks | Validate the security configurations with the industry standards provided by CIS using tools like kube-bench. |
Centralized monitoring | Detect, diagnose, and resolve problems with centralized logging of your distributed architecture. |
CI/CD and GitOps | Reduce the risk of human error, accelerate development, and ensure consistency with automation. |
Static code analysis | Scan the CI/CD code and K8s manifests for exposure of credentials or misconfigured permissions using automated tools like Terrascan. |
Backups and disaster recovery | Maintain appropriate backups and implement a disaster recovery solution. In case of loss of data or infrastructure, you can return to your production state with minimal loss of time and data. |
Namespace management
Kubernetes (K8s) provides a mechanism called namespaces for logically segregating resources or environments within a single cluster. This allows you to partition your cluster into separate environments—such as development, staging, and production—or organize it by teams. Namespaces also support resource quotas and access policies, enabling you to manage resource usage and minimize the risk of interference between environments.
Resources within a namespace are uniquely identified by their name within that namespace, preventing naming conflicts across the cluster. Red Hat OpenShift builds on this concept with OpenShift Projects, which extends namespaces by adding more metadata and context to enhance multitenancy. Projects are essentially built upon resources by associating namespaces with additional attributes like quotas, limits, and network policies.
Consider leveraging projects for fine-grained role-based access control (RBAC) across multiple namespaces. For complex deployments with multiple teams and environments, this means you can achieve greater control and finer isolation of resources.
Resource requests and limits
Resource management is important in any cloud environment to ensure that your applications run smoothly while managing costs. It is recommended to define restrictions for the K8s resources to optimize the utilization of the K8s nodes and prevent one application from starving resources for other pods.
K8s allows you to define CPU and memory quotas. These quotas can be defined per container or at the namespace level. You can define two ranges of quotas, requests, and limits for the resources. Requests are used by the K8s cluster to schedule containers on nodes, and this is the minimum guaranteed resource for the containers. Limits are the maximum allowed resource usage, and containers cannot exceed this limit.
Example
The following manifest file creates an Nginx pod with a request quota of 0.5 CPU and 256 MiB RAM and limits of 1 CPU and 512MiB.
apiVersion: v1 kind: Pod metadata: name: nginx-pod namespace: webserver-namespace spec: containers: - name: nginx-container image: nginx:latest resources: requests: memory: "256Mi" cpu: "500m" limits: memory: "512Mi" cpu: "1"
Automated Kubernetes Data Protection & Intelligent Recovery
Perform secure application-centric backups of containers, VMs, helm & operators
Use pre-staged snapshots to instantly test, transform, and restore during recovery
Scale with fully automated policy-driven backup-and-restore workflows
Pod health check probes
K8s supports configuring probes to check the health status of the containers. It is recommended that these health probes be used so that the K8s cluster can detect when the pods are ready to receive traffic after creation or can restart the non-responsive probes.
There are three types of probes available for containers.
Liveness probes
K8s can monitor container health to check that these are still alive. In case of a long running time, a container can get stuck for a variety of reasons and be unable to serve applications. K8s can monitor the pod’s health periodically via a liveness probe and restart the container in case the health check fails.
Readiness probe
When you launch a container, it might not become ready for service immediately due to some initial time-consuming tasks. Until these initial tasks are complete, the container cannot accept application traffic. K8s uses a readiness probe to determine if the container can provide service as an application endpoint. The readiness health is monitored periodically, and the container can be removed from the service endpoint if the health check fails.
Startup probe
This is helpful for containers that have a long startup time. If defined, it delays the liveness and readiness probes until the startup probe check is successful.
Example
This example creates a web application container with health probes to monitor its status using the following criteria:
- Startup probe: To ensure that the pod starts correctly, an HTTP request will be sent to /startup on port 8080, beginning 20 seconds after the probe starts.
- Readiness probe: Once the pod is started, an HTTP request will be sent to /ready to check that it is ready to receive traffic.
- Liveness probe: Throughout the pod’s lifecycle, periodic HTTP requests will be sent to /healthz to confirm that the pod is still functioning.
Configure the following manifest, where if any of these probes fail consecutively beyond the defined failure threshold, Kubernetes restarts the pod to maintain application availability.
apiVersion: v1 kind: Pod metadata: name: webapp-pod spec: containers: - name: webapp-container image: my-application:latest livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 failureThreshold: 3 startupProbe: httpGet: path: /startup port: 8080 initialDelaySeconds: 20 periodSeconds: 10 failureThreshold: 5
Using ConfigMaps and secrets
ConfigMaps
In Kubernetes, it is recommended to store configuration data separate from application logic. K8s provides storage of configuration data as key-value pairs in ConfigMaps. This allows flexible management of configurations without changing the applications and redeploying container images.
Example
Here’s how to create a ConfigMap to provide database host and port configuration details as key-value pairs.
apiVersion: v1 kind: ConfigMap metadata: name: app-config data: database_host: dbhost.example.com database_port: "3306"
Now apply the ConfigMap:
kubectl apply -f configmap.yaml
The ConfigMap below stores database host and port information, which will be injected into the pod as environment variables.
apiVersion: v1 kind: Pod metadata: name: app-pod spec: containers: - name: app-container image: my-app:latest env: - name: DATABASE_HOST valueFrom: configMapKeyRef: name: app-config key: database_host - name: DATABASE_PORT valueFrom: configMapKeyRef: name: app-config key: database_port
Secrets
In K8s, secrets are used to store sensitive information like passwords, API tokens, and SSH keys. This keeps your sensitive information separate from the application code so that these don’t get exposed. K8s optionally supports encryption at rest for the secrets.
Example
The secrets support two types of maps: data and stringData. The data map requires the values to be encoded in base64 format.
apiVersion: v1 kind: Secret metadata: name: app-credentials type: Opaque data: username: bXl1c2VybmFtZQ== password: bXlzZWN1cmVwYXNz
Here’s how to use kubectl to apply the secret above and use this in a pod via environment variables:
apiVersion: v1 kind: Pod metadata: name: mypod spec: containers: - name: mycontainer image: myimage env: - name: USERNAME valueFrom: secretKeyRef: name: app-credentials key: username - name: PASSWORD valueFrom: secretKeyRef: name: app-credentials key: password
Watch this 1-min video to see how easily you can recover K8s, VMs, and containers
Networking and security
In live production deployments, controlling traffic flow across the K8s cluster and between pods and the external networks is highly recommended. You should only allow IPs and ports that are required for the application to function and block everything else to reduce the attack surface.
Network access control
To apply network access control, you require a K8s network plugin (like Calico or Cilium) that supports network policies.
Example
Create a policy to deny all ingress and egress traffic from pods:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: deny-all namespace: default spec: podSelector: {} # Selects all pods in the namespace policyTypes: - Ingress - Egress
Create a policy to allow ingress traffic from pods labeled frontend to talk to pods labeled backend:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-frontend-ingress namespace: default spec: podSelector: matchLabels: role: backend policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: role: frontend
In addition to applying network policies, you also need to make sure the communication over the network is encrypted to secure it from prying eyes. You should use TLS for your service connections and also encrypt the ingress and egress traffic.
Create a TLS secret:
kubectl create secret tls my-tls-secret \ --cert=path/to/tls.crt \ --key=path/to/tls.key \ --namespace=default
Create an ingress resource that uses TLS for communication:
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: my-ingress namespace: default annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: tls: - hosts: - yourdomain.com secretName: my-tls-secret rules: - host: yourdomain.com http: paths: - path: / pathType: Prefix backend: service: name: my-service port: number: 80
Role-based access control (RBAC)
In any infrastructure operations, it is highly recommended that each user or component be restricted to the minimum level of required permissions. In K8s, RBAC is used to define and control access to resources within the cluster. You can define which user or application has what type of access to K8s resources like pods or services. You can define a role that associates permissions within a specific namespace, and you can create a ClusterRole that defines cluster-wide permissions.
Example
Create a role that allows read-only access to pods within a specific namespace:
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: prod-env name: pod-reader rules: - apiGroups: [""] # "" indicates the core API group resources: ["pods"] verbs: ["get", "list", "watch"]
Create a ClusterRole that has read-only permissions across multiple namespaces.
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: cluster-pod-reader rules: - apiGroups: [""] # "" indicates the core API group resources: ["pods"] verbs: ["get", "list", "watch"]
Bind the read-only role to a specific user for a specific namespace:
apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: read-pods namespace: default subjects: - kind: User name: [email protected] apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: pod-reader apiGroup: rbac.authorization.k8s.io
Assign a cluster-wide read-only role to a specific user:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: read-pods-cluster-wide subjects: - kind: User name: [email protected] apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: cluster-pod-reader apiGroup: rbac.authorization.k8s.io
Learn about the features that power Trilio’s intelligent backup and restore
Security context
K8s provides security contexts that you can use to restrict containers, kernel capabilities, and access to the filesystem. You can use this to limit privilege escalation and force the containers to run as specific non-root users. You can also define specific permissions for the mounted filesystems.
Example
Create a container to run with specific non-root user and group permissions, disallow privilege escalation, ensure that the file systems are mounted with ownership of a specific group, and check that the root file systems are mounted as read-only.
apiVersion: v1 kind: Pod metadata: name: secure-app spec: securityContext: runAsUser: 1000 runAsGroup: 2000 fsGroup: 3000 containers: - name: secure-container image: my-secure-app:latest securityContext: runAsNonRoot: true allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true
OpenShift security context
RedHat OpenShift enforces a high level of security by limiting pods to the fewest privileges possible. The platform prevents containers from protected functions like root access and shared file system access. The OpenShift Restricted Security Context Constraints (SCC) ensure that containers are run using arbitrarily assigned user IDs, root filesystems are mounted as read-only, and the container network interface is isolated from the host system.
Application security
When using container images in K8s, you should always be using trusted sources like the official repository from the Docker Hub or your own private repo. Alternatively, you can also download images from verified publishers like Nginx, VMware, or HashiCorp that publish high-quality, secure images.
You should be using the most recent version of images that include the latest security patches. For better consistency across the application environment, pin specific versions of the image and avoid the latest tag.
Example
Create an Nginx deployment using the official docker hub repository of Nginx and a specific mainline release version.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment labels: app: nginx spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx # Image from the official Docker Hub NGINX repo image: nginx:1.27.2 # Specific version tag ports: - containerPort: 80
CIS Kubernetes benchmarks
The Center for Internet Security (CIS) provides security guidelines and benchmarks for operating systems, applications, and cloud operations. These controls are the recommended security practices, created using real-world experiences and consensus review of a global community of subject matter experts.
Adopting the recommendations of the CIS Benchmark for Kubernetes is an essential step in verifying the security of the K8s environment. The benchmark provides an extensive list of checks and recommended practices for API server configuration, etcd configuration, kubelet and node security, network policies, and pod security.
Although running tests manually is one option, the CIS recommended checks can be automated using open-source utility tools like Kube-Bench. Kube-Bench can run the checks documented in the CIS K8s benchmark and provide a pass/fail report for each individual configuration check.
A sample CIS check run using Kube-Bench (Source)
Deployments and replica sets
In a K8s live production environment, you need to ensure the availability of pods and apply update strategies to manage the life cycle of your applications. Using deployments, you can define rolling updates and rollbacks, allowing you to modify applications without downtime. With replica sets, you can maintain a stable set of pod replicas at any time. You can define a sufficient number of replicas required for the efficient running of your application and maintain availability in case of node failure.
Example
Here, we will create a nginx deployment with a rolling update strategy and three replicas. The deployment will ensure that only one pod is unavailable during updates and will create one additional pod during updates to reduce downtime:
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment labels: app: nginx spec: replicas: 3 # Number of replicas selector: matchLabels: app: nginx strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # pods unavailable during update maxSurge: 1 # additional pods above number of replicas template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.27.2 # NGINX image ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 10
Logging and monitoring
K8s is a complex platform with lots of moving components. The system can generate hundreds of thousands of logs, events, and metrics. Centralized observability and monitoring solutions are highly recommended for proper visibility, diagnosis, and troubleshooting.
You can use a log collection tool like Fluentd to collect logs from K8s, transform them into your required format, and ship them to the Elasticsearch backend. Elasticsearch is a real-time distributed and scalable search engine that can index and search through a large volume of data. In addition to the above, Kibana can be deployed alongside Elasticsearch for visualization and dashboards.
For collecting metrics, you can use Prometheus along with Grafana for visualizing and alerting based on these metrics.
CI/CD and GitOps
In large and complex infrastructure operations, it is highly recommended that automation tools and version control systems be used. With automation, you can reduce the risks of human errors, improve operational efficiency, and ensure consistent deployments across your infrastructure. With version control systems, you can enforce best practices of change management, tracking, auditing, approval, and rollback.
You can use version control platforms like GitHub or GitLabs to manage your YAML manifests. To execute your pipeline deployments, you can use Jenkins as a general-purpose automation tool with a rich set of plugins to integrate with different platforms. Another option is ArgoCD, which has native integrations with Git and Kubernetes.
OpenShift Pipelines
Redhat OpenShift Pipelines is a cloud-native CI/CD system built for decentralized teams working on microservices architecture. Organizations already using OpenShift can leverage Pipelines because it is natively integrated into Kubernetes. This makes it easier to manage pipelines, as they are handled using the same Kubernetes APIs and tools without the need for external CI/CD tools.
Static code analysis
When you have a large team working on multiple aspects of your application and deploying various infrastructure components using infrastructure as code (IaC), it is highly recommended to use automated security scanning tools to perform code analysis. With multiple engineers with different skill levels, there are bound to be mistakes, like exposing secrets in plaintext, misconfigured permissions, and missing RBAC controls.
You can use open-source tools like Terrascan to perform automated code analysis. Terrascan includes over 500 policies to scan your code against standards like CIS Benchmarks. You can integrate Terrascan with your CI/CD process to automate the code analysis process.
Example
Consider the following deployment manifest for Apache:
apiVersion: apps/v1 kind: Deployment metadata: name: apache-deployment labels: role: webserver spec: replicas: 2 selector: matchLabels: role: webserver template: metadata: labels: role: webserver spec: containers: - name: frontend image: httpd ports: - containerPort: 80
Scanning this with Terrascan would list the policy violations detected in the code:
$ terrascan scan -i k8s -f apache-deployment.yaml Violation Details - Description : CPU Limits Not Set in config file. File : wordpress-deployment.yaml Line : 28 Severity : MEDIUM ------------------------------------------------------------------ Description : No readiness probe will affect automatic recovery in case of unexpected errors File : apache-deployment.yaml Line : 28 Severity : LOW ------------------------------------------------------------------ Description : Apply Security Context to Your Pods and Containers File : apache-deployment.yaml Line : 28 Severity : MEDIUM ------------------------------------------------------------------ Description : Image without digest affects the integrity principle of image security File : apache-deployment.yaml Line : 28 Severity : MEDIUM ------------------------------------------------------------------ Description : Prefer using secrets as files over secrets as environment variables File : apache-deployment.yaml Line : 28 Severity : HIGH ------------------------------------------------------------------
Disaster recovery and backups
For business-critical applications it is essential to have a proper disaster recovery plan. Failure is always a possibility due to a system failure, human error, or malicious activity. As part of a set of best practices, you should always have a backup of your critical data and components that you can quickly restore in case of any type of failure.
Backup of etcd
Etcd is a distributed key-value store that holds all K8s objects and cluster states. For production environments, it is important to deploy etcd as a multi-node cluster with a minimum of three nodes (five recommended). You should also take regular backups of the etcd state to recover from any type of loss or corruption of data.
Etcd supports built-in snapshots as follows. The snapshots should be stored somewhere offsite that is separate from the main production environment:
ubuntu@node1:~$ sudo etcdctl --endpoints https://node1:2379 \ --cacert=/etc/ssl/etcd/ssl/ca.pem \ --cert=/etc/ssl/etcd/ssl/member-node1.pem \ --key=/etc/ssl/etcd/ssl/member-node1-key.pem \ snapshot save etcd-backup.db . . Snapshot saved at etcd-backup.db
Backup of persistent volumes
The etcd snapshots will only save the configuration state of the k8s cluster, not the application data. Applications like databases use persistent volumes to store data. You can take snapshots of the storage volumes if the infrastructure supports backups (like AWS EBS snapshots or Azure Disk snapshots).
Trilio for Kubernetes
The two backup options discussed above have some limitations. The etcd snapshot only saves the configuration state of the K8s cluster. The volume snapshots are dependent on the storage infrastructure and require a manual recovery process to bring the k8s cluster and applications back up.
The third option is to use Trilio for Kubernetes (T4K). With T4K, you can fully orchestrate your backup and recovery process even in case of failure of a complete K8s cluster. With T4K, you can protect metadata, operators, container image registry, virtual machines, and persistent volumes. You can perform partial disaster recovery (like metadata only) or complete disaster recovery on any K8s cluster whether on-prem or on-cloud.
Learn about a lead telecom firm solved K8s backup and recovery with Trilio
Last thoughts
Working with a production environment has its own set of challenges. You have to deal with infrastructure failure, malicious activities, human errors, and cost optimization and efficiency concerns, while serving unpredictable work-loads. You must design and operate your infrastructure so that you are able to preempt as many avenues of failures as possible. Working with the best practices of K8s operations, you can mitigate failures and provide the best possible response in case of a disaster situation.
- Summary of key Kubernetes best practices
- Namespace management
- Resource requests and limits
- Pod health check probes
- Using ConfigMaps and secrets
- Networking and security
- CIS Kubernetes benchmarks
- Deployments and replica sets
- Logging and monitoring
- CI/CD and GitOps
- Static code analysis
- Disaster recovery and backups
- Last thoughts
Like This Article?
Subscribe to our LinkedIn Newsletter to receive more educational content