In your Kubernetes cluster a node can die or reboot.
This kind of tools like Kubernetes are high available and designed to be robust and auto recover in such scenarios, and Kubernetes accomplish this very well.
But, you might notice that when a Node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests, will fail.
That time can be reduced, because in my opinion, by default is too high. There are a bunch of parameters to tweak in the Kubelet and in the Controller Manager.
This is the workflow of what happens when a node gets down:
1- The Kubelet posts its status to the masters using –node-status-update-frequency=10s
2- A node dies
3- The kube controller manager is the one monitoring the nodes, using –-node-monitor-period=5s it checks, in the masters, the node status reported by the Kubelet.
4- Kube controller manager will see the node is unresponsive, and has this grace period –node-monitor-grace-period=40s until it considers the node unhealthy. This parameter must be N times node-status-update-frequency being N the number of retries allowed for the Kubelet to post node status. N is a constant in the code equals to 5, check var nodeStatusUpdateRetry in https://github.com/kubernetes/kubernetes/blob/e54ebe5ebd39181685923429c573a0b9e7cd6fd6/pkg/kubelet/kubelet.go
Note that the default values don’t fulfill what the documentation says, because:
node-status-update-frequency x N != node-monitor-grace-period (10 x 5 != 40)
But what i can understand, is that 5 post attempts of 10s each, are done in 40s, the first one in second zero, second one in second 10, and so on until the fifth and last one is done in second 40.
So the real equation would be:
node-status-update-frequency x (N-1) != node-monitor-grace-period
More info:
https://github.com/kubernetes/kubernetes/blob/3d1b1a77e4aca2db25d465243cad753b913f39c4/pkg/controller/node/nodecontroller.go
5- Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
6- Kube proxy has a watcher over the API, so the very first moment the pods are evicted the proxy will notice and update the iptables of the node, removing the endpoints from the services so the failing pods won’t be accessible anymore.
These values can be tweaked so you will get less failed requests if a node gets down.
I’ve set these in my cluster:.
kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)
The results are quite good, we’ve moved from a node down detection of 5m40s to 46s