Kubernetes Pod Eviction and Preemption: How do they work?
In Kubernetes best practices, implementing quotas and limits for each namespace addresses some of the cluster's activity overload issues.
As an interesting anecdote, one day I encountered the following problem: I couldn't enforce quotas or impose limits, or anything of the sort in various namespaces for non-technical reasons.
So, I delved into the mechanics of pod scheduling and eviction to see how I could make certain pods more prioritized than others.
We will revisit together the logic used by Kubernetes when it comes to eviction (killing pods due to insufficient resources) or preemption of pods (starting pods) in order to improve your resilience.
Kubernetes pod eviction
When your cluster is under saturation, you can, of course, activate auto-scaling for your nodes and pods, but this isn't instantaneous, especially for nodes. This task can be more challenging in on-premises environments, for example.
When your cluster reaches saturation, Kubernetes must make decisions to ensure that priority pods are not evicted from nodes under pressure. If you see pods with a "evicted" status, this should pique your interest.
Kubernetes assigns a Quality of Service (QoS) to each pod based on their requests and limits:
Guaranteed: This QoS is the safest and provides you with more assurance of not being evicted. Each pod must have a request equal to its limit for CPU and memory.
Burstable: This QoS is assigned if you have at least one "request" for your pod.
BestEffort: This QoS is the first to be evicted when the node is under pressure. If you don't use any limits or requests, this QoS will be assigned to your pod.
Understanding the importance of setting limits and requests for your workloads is essential to better anticipate their behavior during peak loads.
Kubernetes pod Preemption
We've just seen what happens with pod eviction, meaning pods that are already running. To recap, Quality of Service (QoS) is solely intended to protect pods during eviction, not during scheduling. For example, if a pod fails and needs to be rescheduled when the cluster is at 100% capacity, what comes into play is what we call pod preemption.
In other words, it's a priority given to pods that Kubernetes considers during scheduling. By default, there are several PriorityClasses commonly used by system components:
$ kubectl get priorityclass
NAME VALUE GLOBAL-DEFAULT AGE
system-cluster-critical 2000000000 false 7m56s
system-node-critical 2000001000 false 7m56s
If we take a closer look at how this works on GKE (Google Kubernetes Engine)
kubectl get pods --all-namespaces -o=custom-columns=NAME:.metadata.name,PRIORITY_CLASS:.spec.priorityClassName --sort-by='.metadata.creationTimestamp' | column -t
NAME PRIORITY_CLASS
kube-dns-7d5998784c-kw98d system-cluster-critical
kube-dns-autoscaler-9f89698b6-shzj8 system-cluster-critical
event-exporter-gke-857959888b-kwpwm <none>
konnectivity-agent-67c68ff74f-kwfhl system-cluster-critical
konnectivity-agent-autoscaler-bd45744cc-49bbf system-cluster-critical
l7-default-backend-6dc845c45d-4jt95 <none>
gke-metrics-agent-hj942 system-node-critical
fluentbit-gke-qmmbl system-node-critical
pdcsi-node-x5fbb system-node-critical
pdcsi-node-6n2tl system-node-critical
gke-metrics-agent-gxm2s system-node-critical
fluentbit-gke-4bp9n system-node-critical
fluentbit-gke-pd4dh system-node-critical
gke-metrics-agent-wpqft system-node-critical
pdcsi-node-k2z6v system-node-critical
kube-proxy-gke-cluster-1-default-pool-aaab76fd-85qj system-node-critical
konnectivity-agent-67c68ff74f-5bxbk system-cluster-critical
kube-dns-7d5998784c-9qfnm system-cluster-critical
konnectivity-agent-67c68ff74f-hj2b4 system-cluster-critical
metrics-server-v0.5.2-6bf845b67f-gslvk system-cluster-critical
kube-proxy-gke-cluster-1-default-pool-aaab76fd-p2tk system-node-critical
kube-proxy-gke-cluster-1-default-pool-aaab76fd-7mzj system-node-critical
We can see that all the pods except 2 have high PriorityClasses and can still be rescheduled even under pressure.
Use case Kyverno
Let's imagine that we want to install Kyverno. Here are the commands to install Kyverno, for example.
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno -n kyverno --create-namespace --set replicaCount=3
By default, pods don't have any PriorityClass, but I consider this application to be very important and it should always be up and running despite possible overload. We can see that Kubernetes doesn't assign any QoS Guarantee or PriorityClass.
vincn_ledan@cloudshell:~ (medium-article-377208)$ kubectl get pods -n kyverno -o custom-columns="NAME:.metadata.name, QOS:.status.qosClass, REQUESTS:.spec.containers[*].resources.requests, LIMITS:.spec.containers[*].resources.limits, PRIORITY_CLASS:.spec.priorityClassName"
NAME QOS REQUESTS LIMITS PRIORITY_CLASS
kyverno-5f69449d46-9gcfc Burstable map[cpu:100m memory:128Mi] map[memory:384Mi] <none>
kyverno-5f69449d46-c96w6 Burstable map[cpu:100m memory:128Mi] map[memory:384Mi] <none>
kyverno-5f69449d46-f8dgb Burstable map[cpu:100m memory:128Mi] map[memory:384Mi] <none>
kyverno-cleanup-controller-8d8cbd588-8tlrh Burstable map[cpu:100m memory:64Mi] map[memory:128Mi] <none>
kyverno-cleanup-controller-8d8cbd588-gcl6b Burstable map[cpu:100m memory:64Mi] map[memory:128Mi] <none>
kyverno-cleanup-controller-8d8cbd588-ghtf8 Burstable map[cpu:100m memory:64Mi] map[memory:128Mi] <none>
Now let's ensure that these pods are in the Guaranteed QoS class and also in a specific PriorityClass. We will assign the value 999999999, which is below the value of system PriorityClasses and just below the maximum value of 1 billion.
Let's create the priority.yaml file and apply it:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-class
value: 999999999
globalDefault: false
description: " high priority"
Let's reinstall Kyverno with the correct parameters:
helm install kyverno kyverno/kyverno \
--namespace kyverno \
--create-namespace \
--set replicaCount=3 \
--set test.resources.limits.cpu=100m \
--set test.resources.limits.memory=128Mi \
--set test.resources.requests.cpu=100m \
--set test.resources.requests.memory=128Mi \
--set resources.requests.cpu=100m \
--set resources.requests.memory=128Mi \
--set resources.limits.cpu=100m \
--set resources.limits.memory=128Mi \
--set initResources.limits.memory=64Mi \
--set initResources.limits.cpu=10m \
--set initResources.requests.memory=64Mi \
--set initResources.requests.cpu=10m \
--set cleanupController.resources.limits.memory=64Mi \
--set cleanupController.resources.limits.cpu=100m \
--set cleanupController.resources.requests.memory=64Mi \
--set cleanupController.resources.requests.cpu=100m \
--set priorityClassName=high-priority-class
Now we can see that the pods have the correct QoS and the correct PriorityClass.
vincn_ledan@cloudshell:~ (medium-article)$ kubectl get pods -n kyverno -o custom-columns="NAME:.metadata.name, QOS:.status.qosClass, REQUESTS:.spec.containers[*].resources.requests, LIMITS:.spec.containers[*].resources.limits, PRIORITY_CLASS:.spec.priorityClassName"
NAME QOS REQUESTS LIMITS PRIORITY_CLASS
kyverno-6fd6db997c-ptc8n Guaranteed map[cpu:100m memory:256Mi] map[cpu:100m memory:256Mi] high-priority-class
kyverno-6fd6db997c-q7hwj Guaranteed map[cpu:100m memory:256Mi] map[cpu:100m memory:256Mi] high-priority-class
kyverno-6fd6db997c-zwj5m Guaranteed map[cpu:100m memory:256Mi] map[cpu:100m memory:256Mi] high-priority-class
First, we are going to stress the cluster.
kubectl apply -f https://raw.githubusercontent.com/giantswarm/kube-stresscheck/master/examples/cluster.yaml
Then, deploy a simple nginx pod. It can be noticed that this pod cannot be scheduled because there are no resources available.
kubectl create deploy nginxtest --image=nginx
kubectl get pods
default nginxtest-699448454d-8kmpb 0/1 Pending 0 11s
Now, let's delete one of the Kyverno pods; this should allow the pod to be scheduled due to its high priority compared to the nginx pod.
kubectl delete pods kyverno-6fd6db997c-q7hwj -n kyverno --force
kubectl get pods -n kyverno
NAME READY STATUS RESTARTS AGE
kyverno-6fd6db997c-ptc8n 1/1 Running 0 16m
kyverno-6fd6db997c-xc226 1/1 Running 0 9m32s
kyverno-6fd6db997c-zwj5m 1/1 Running 0 16m
Advice
It is important to be able to classify your applications in order to assign them the correct QoS and PriorityClass and better manage the cluster's behavior.