In your Kubernetes cluster a node can die or reboot.
This kind of tools like Kubernetes are high available and designed to be robust and auto recover in such scenarios, and Kubernetes accomplish this very well.
But, you might notice that when a Node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests, will fail.
That time can be reduced, because in my opinion, by default is too high. There are a bunch of parameters to tweak in the Kubelet and in the Controller Manager.
This is the workflow of what happens when a node gets down:
1- The Kubelet posts its status to the masters using –node-status-update-frequency=10s
2- A node dies
3- The kube controller manager is the one monitoring the nodes, using –-node-monitor-period=5s it checks, in the masters, the node status reported by the Kubelet.
4- Kube controller manager will see the node is unresponsive, and has this grace period –node-monitor-grace-period=40s until it considers the node unhealthy. This parameter must be N times node-status-update-frequency being N the number of retries allowed for the Kubelet to post node status. N is a constant in the code equals to 5, check var nodeStatusUpdateRetry in https://github.com/kubernetes/kubernetes/blob/e54ebe5ebd39181685923429c573a0b9e7cd6fd6/pkg/kubelet/kubelet.go
Note that the default values don’t fulfill what the documentation says, because:
node-status-update-frequency x N != node-monitor-grace-period (10 x 5 != 40)
But what i can understand, is that 5 post attempts of 10s each, are done in 40s, the first one in second zero, second one in second 10, and so on until the fifth and last one is done in second 40.
So the real equation would be:
node-status-update-frequency x (N-1) != node-monitor-grace-period
More info:
5- Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
6- Kube proxy has a watcher over the API, so the very first moment the pods are evicted the proxy will notice and update the iptables of the node, removing the endpoints from the services so the failing pods won’t be accessible anymore.
These values can be tweaked so you will get less failed requests if a node gets down.
I’ve set these in my cluster:.
kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)
The results are quite good, we’ve moved from a node down detection of 5m40s to 46s
[…] Improving Kubernetes reliability: quicker detection of a Node down […]
nice!
what happens if all kubelets disconnect at once?
this means you will loose all your pods really fast.
Actually, you won’t, but just because you are lucky. If all the Kubelets are down, the scheduler doesn’t have any place to move the Pods to, so the Pods will still run although the Kubelets are down, all nodes will be marked as NotReady but the Pods will still running.
However, this is a situation you don’t want, and that should never happen given the HA Kubernetes has.
nice work
You are right regarding pod eviction, but once nodes are marked as not ready, then kube-proxy will remove all end points.
Interesting, could you please point at the doc that says this?
Assuming the unhealthy node losses connectivity to the apiserver (network partition), how long will kubelet on that worker wait before stopping the pods it’s running?
I’m not sure about the answer of this, i would need to run some tests.
But what i know is that if this happens, from Kubernetes you will probably see the Pods in Unknown state, however the Pods in the unhealthy node will remain alive.
I don’t know for how long, right now i’m not sure how fast the Kubelet realizes it can’t reach the apiserver.
Which file in Kubernetes cluster to change these values to take effect?
That depends on how your deployment is but you need to set those values in the Kubelet and controller manager configuration, where are those files? it depends on your type of deployment.
Take into account that this post is from 2 years ago and many things have changed in Kubernetes since then, maybe these parameters and values don’t apply anymore.
I have a container which is listening on a rabbitmq bus and forwards the status of the Cluster to UI. How can i listen to a node failure event to pass the message to rabbitmq, if the node is failed or not. Other option i have is to do a regular polling to get the status using kubectl get nodes.
Normally that should be done through a monitoring platform that monitors the Kubernetes cluster from the outside that tells you for example, when Pods or Nodes are failing.
However, if you want to check if a node is healthy from within Kubernetes you can use this: https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/
And from your container you can run http queries to the API server metrics that will tell you the state of basically everything, including the Node state.
[…] for the pods to be rescheduled on the healthy nodes, even after the configuration changes mentioned here. Need to look into this a bit […]
[…] for the pods to be rescheduled on the healthy nodes, even after the configuration changes mentioned here. Need to look into this a bit more. Here is a video demoing the same in a small cluster. We can […]
It is ok that we are able to evict the nodes with the given parameters. But why liveness probe and readiness probe does not understand that application is not in healthy state?
I understand that when you say applications, you mean the Pods. If a node goes down, the Pods die so at that stage there are no liveness or readiness probes. What this is trying to improve is the time for pod rescheduling from one node to another.
[…] With the configuration above, the time from when a node goes down to when failed pods are rebalanced to other nodes is about 45 seconds. Much better. More info about thishere. […]
Hello,
I deployed kubernetes cluster using Kubeadm, may i know how can i set up these parameters : kubelet: node-status-update-frequency
controller-manager: node-monitor-period
controller-manager: node-monitor-grace-period
controller-manager: pod-eviction-timeout
Thank you.
At the time of writing this post kubeadm didn’t exist, and Kubernetes version was much older, so i don’t know the equivalent to this right now. But at the end of the day, the same components are running in the cluster, either deployed manually or through kubeadm, so it is just a matter of finding where they are (probably running as containers) and update the parameters the container passes to the runtime.
Hello,
I have a question.How to set node-monitor-grace-period=3s or less.Because, i want to implement such a function.
I wrote a yaml file from which to include
“ tolerations:
– key: “node.kubernetes.io/unreachable”
operator: “Exists”
effect: “NoExecute”
tolerationSeconds: 1
– key: “node.kubernetes.io/not-ready”
operator: “Exists”
effect: “NoExecute”
tolerationSeconds: 1”
When a working node goes down, the pod will restart within 4 seconds in another working nodes. The restart condition is that the state of node is notready.
How do I need to modify it? What parameters are modified? Need to modify the ‘nodeStatusUpdateFrequency’ parameter in the kubelet?
My kubernetes is v1.14.2.I used kubeadm.
I am not good at English.
[…] generous for our use case (for example here).Drastically reducing the timeouts as explained here, here and here — in some cases from 60 seconds (!) to only 4 — was enough to reduce the request stall […]
[…] generous for our use case (for example here).Drastically reducing the timeouts as explained here, here and here — in some cases from 60 seconds (!) to only 4 — was enough to reduce the request stall […]