Improving Kubernetes reliability: quicker detection of a Node down

In your Kubernetes cluster a node can die or reboot.

This kind of tools like Kubernetes are high available and designed to be robust and auto recover in such scenarios, and Kubernetes accomplish this very well.

But, you might notice that when a Node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests, will fail.

That time can be reduced, because in my opinion, by default is too high. There are a bunch of parameters to tweak in the Kubelet and in the Controller Manager.

This is the workflow of what happens when a node gets down:

1- The Kubelet posts its status to the masters using –node-status-update-frequency=10s

2- A node dies

3- The kube controller manager is the one monitoring the nodes, using -node-monitor-period=5s it checks, in the masters, the node status reported by the Kubelet.

4- Kube controller manager will see the node is unresponsive, and has this grace period –node-monitor-grace-period=40s until it considers the node unhealthy. This parameter must be N times node-status-update-frequency being N the number of retries allowed for the Kubelet to post node status. N is a constant in the code equals to 5, check var nodeStatusUpdateRetry in https://github.com/kubernetes/kubernetes/blob/e54ebe5ebd39181685923429c573a0b9e7cd6fd6/pkg/kubelet/kubelet.go

Note that the default values don’t fulfill what the documentation says, because:

node-status-update-frequency x N != node-monitor-grace-period   (10 x 5 != 40)

But what i can understand, is that 5 post attempts of 10s each, are done in 40s, the first one in second zero, second one in second 10, and so on until the fifth and last one is done in second 40.

So the real equation would be:

node-status-update-frequency x (N-1) != node-monitor-grace-period

More info:

https://github.com/kubernetes/kubernetes/blob/3d1b1a77e4aca2db25d465243cad753b913f39c4/pkg/controller/node/nodecontroller.go

5- Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s

This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.

6- Kube proxy has a watcher over the API, so the very first moment the pods are evicted the proxy will notice and update the iptables of the node, removing the endpoints from the services so the failing pods won’t be accessible anymore.

These values can be tweaked so you will get less failed requests if a node gets down.

I’ve set these in my cluster:.

kubelet: node-status-update-frequency=4s (from 10s)

controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)

The results are quite good, we’ve moved from a node down detection of 5m40s to 46s

 

Advertisements

14 thoughts on “Improving Kubernetes reliability: quicker detection of a Node down

    • Actually, you won’t, but just because you are lucky. If all the Kubelets are down, the scheduler doesn’t have any place to move the Pods to, so the Pods will still run although the Kubelets are down, all nodes will be marked as NotReady but the Pods will still running.

      However, this is a situation you don’t want, and that should never happen given the HA Kubernetes has.

  1. You are right regarding pod eviction, but once nodes are marked as not ready, then kube-proxy will remove all end points.

  2. Assuming the unhealthy node losses connectivity to the apiserver (network partition), how long will kubelet on that worker wait before stopping the pods it’s running?

    • I’m not sure about the answer of this, i would need to run some tests.
      But what i know is that if this happens, from Kubernetes you will probably see the Pods in Unknown state, however the Pods in the unhealthy node will remain alive.
      I don’t know for how long, right now i’m not sure how fast the Kubelet realizes it can’t reach the apiserver.

    • That depends on how your deployment is but you need to set those values in the Kubelet and controller manager configuration, where are those files? it depends on your type of deployment.

      Take into account that this post is from 2 years ago and many things have changed in Kubernetes since then, maybe these parameters and values don’t apply anymore.

  3. I have a container which is listening on a rabbitmq bus and forwards the status of the Cluster to UI. How can i listen to a node failure event to pass the message to rabbitmq, if the node is failed or not. Other option i have is to do a regular polling to get the status using kubectl get nodes.

  4. Normally that should be done through a monitoring platform that monitors the Kubernetes cluster from the outside that tells you for example, when Pods or Nodes are failing.
    However, if you want to check if a node is healthy from within Kubernetes you can use this: https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/

    And from your container you can run http queries to the API server metrics that will tell you the state of basically everything, including the Node state.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s