Kubernetes offers numerous features to ensure high availability for applications running in its cluster environment. One critical aspect of maintaining high availability is managing pod disruptions effectively.

In Kubernetes terminology, pod disruption refers to the temporary unavailability or termination of a Kubernetes pod, which can occur voluntarily or involuntarily.

Understanding Pod Disruption

Pods do not disappear until someone (a user or controller) destroys them, or there is an unavoidable hardware or system software error.

Pod disruptions can be classified into voluntary and involuntary disruptions. Voluntary disruptions are intentional and controlled, while involuntary disruptions are unexpected and beyond one’s control.

Voluntary disruptions

  • Draining a node a Kubernetes upgrade or reboot

  • Draining a node from a cluster to scale down using Cluster Autoscaling

  • Updating a deployment’s pod template causing a restart

  • Deleting a pod (e.g. by accident)

These actions can be take by the cluster administration or by the application owner.

Involuntary disruptions

  • Node outage due to hardware or hypervisor failure or kernel panic

  • Node disappears from the cluster due to cluster network partition

  • Eviction of a pod due to the node being out-of-resources

All these conditions should be familiar to most users except for the out-of-resources condition because of Node-pressure eviction

Best Practices for High Availability

Let’s delve into common issues scenarios and corresponding best practices solutions to prevent application downtime:

1. Deploy Multiple Pods (Replicas)

Issue: Deploying only one pod can lead to application downtime if it becomes unavailable, such as during a pod restart.

Solution: Set the replicas value greater than or equal to 2 to ensure redundancy and high availability. Optionally, use an autoscaler solution (e.g. HorizontalPodAutoscaler or Keda’s ScaledObject) to scale out when needed.

References:

2. Spread Pods Across Availability Zones

Issue: Deploying pods on a single node or within one availability zone can lead to application downtime if that node or zone becomes unavailable.

Solution: Utilize TopologySpreadConstraints or pod anti-affinity to distribute pods across availability zones and set replicas value greater than or equal to 3 to ensure at least one pod is located in each availability zone.

References:

3. Avoid Simultaneous Pod Disruptions

Kubernetes offers features to help you run highly available applications even when you introduce frequent voluntary disruptions.

Issue: Simultaneously draining all nodes hosting application pods can cause application downtime during voluntary disruptions.

Solution: Implement PodDisruptionBudget (PDB) resource to ensure a minimum number of pods remain operational during voluntary disruptions. Do not use a PDB when you only have one replica, as this will block Kubernetes and node upgrades.

References:

4. Correctly Sized Resource Requests and Limits

Requests specify the minimum CPU and memory requirements for a container to operate, with the kube-scheduler utilizing resource requests to determine suitable nodes. Limits establish the maximum CPU and memory usage permitted for a container, preventing excessive resource consumption and safeguarding cluster stability and performance.

Issue: Undersized or improperly configured resource requests and limits can lead to resource contention, out-of-memory issues, or pod restarts.

Solution: Set memory requests and limits according to nominal usage, set CPU requests according to nominal usage, and avoid setting CPU limits.

References:

5. Always Use Health Checks (Liveness and Readiness Probes)

Kubernetes probes, specifically the LivenessProbe and ReadinessProbe, are essential for effective health monitoring of pods. The LivenessProbe determines whether the process within the pod is running, while the ReadinessProbe determines whether the service within the pod is prepared to accept traffic.

Issue: Not defining liveness and readiness probes can result in Kubernetes not detecting crashed or unhealthy application states, leading to downtime or errors.

Solution: Always set liveness and readiness probes to ensure Kubernetes can accurately determine the health and readiness of pods.

References:

6. Use Readiness Probe During Rolling Updates

The rolling deployment is the default deployment strategy in Kubernetes. It replaces pods, one by one, of the previous version of our application with pods of the new version without any cluster downtime.

Issue: Rolling updates, while ensuring minimal downtime, can still result in downtime until new pods are ready to handle requests.

Solution: Always use readiness probes during rolling updates to ensure new pods are fully ready before removing old ones, thus minimizing downtime.

References:

7. Graceful Termination

To ensure graceful pod termination, the application needs to handle the SIGTERM signal properly. This involves stopping traffic, closing database connections, and completing any ongoing operations before shutting down. Kubernetes default wait time for handling SIGTERM is 30 seconds, but this can be adjusted for longer shutdown periods or when deploying a new app.

Issue: When a pod is shut down without handling the termination signal, it can lead to errors.

Solution: To ensure a graceful shutdown of the application, handle the SIGTERM signal and consider extending the wait timeout if necessary.

References:

Summary

In summary, ensuring high availability in Kubernetes applications involves:

  • Have at least two replicas, optionally use autoscaling

  • Add health checks (probes)

  • Add a PodDisruptionBudget

  • Use pod anti affinity or topology spread constraints

  • Allocate sufficient resources

  • The app has the handle SIGTERM

By following these best practices, Kubernetes users can significantly enhance the availability and reliability of their applications in cluster environments. Don’t forget to monitor application workloads to detect downtime or unexpected errors.