Heads up! Running mongos in containers

Mongos is the MongoDB query router (https://docs.mongodb.com/manual/reference/program/mongos/) and all the recommendations tells you that you should run a mongos process locally along with the service it is going to use it.

For traditional applications that’s totally fine, but when you run it in containers you need to be aware of two things.

1- The dynamic nature of containers make the usage of mongos a bit inefficient

2- Mongos is not cgroups aware

Let me go into details:

Continue reading

Pykube now supports Google Cloud Platform clusters (OAuth2)

I’ve been contributing to Pykube project (https://github.com/kelproject/pykube/) recently to add support for Google Cloud Platform clusters.

Kubernetes has multiple ways of authentication, and Pykube was supporting Bearer Token, Basic Auth and X509 client certificates.

For our use case where we manage 4 Kubernetes clusters (2 baremetal in our datacenter and 2 in GCP) and we want to automate them all, we need this feature to be available.

GCP uses Bearer tokens to authenticate, but those tokens are generated by Google, and they expire after one hour. So, getting the token with kubectl and then using Pykube with Bearer token auth was not enough due to its expiration.

Instead, now Pykube supports full OAuth2 authentication that fetchs the token from GCP if it is not set or if it is expired, the same as kubectl does.

Both user and service GCP accounts work with this library but you need to set your gcloud credentials to make it work.

How to set it up?

Continue reading

Improving Kubernetes reliability: quicker detection of a Node down

In your Kubernetes cluster a node can die or reboot.

This kind of tools like Kubernetes are high available and designed to be robust and auto recover in such scenarios, and Kubernetes accomplish this very well.

But, you might notice that when a Node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests, will fail.

That time can be reduced, because in my opinion, by default is too high. There are a bunch of parameters to tweak in the Kubelet and in the Controller Manager.

This is the workflow of what happens when a node gets down:

1- The Kubelet posts its status to the masters using –node-status-update-frequency=10s

2- A node dies

3- The kube controller manager is the one monitoring the nodes, using -node-monitor-period=5s it checks, in the masters, the node status reported by the Kubelet.

4- Kube controller manager will see the node is unresponsive, and has this grace period –node-monitor-grace-period=40s until it considers the node unhealthy. This parameter must be N times node-status-update-frequency being N the number of retries allowed for the Kubelet to post node status. N is a constant in the code equals to 5, check var nodeStatusUpdateRetry in https://github.com/kubernetes/kubernetes/blob/e54ebe5ebd39181685923429c573a0b9e7cd6fd6/pkg/kubelet/kubelet.go

Note that the default values don’t fulfill what the documentation says, because:

node-status-update-frequency x N != node-monitor-grace-period   (10 x 5 != 40)

But what i can understand, is that 5 post attempts of 10s each, are done in 40s, the first one in second zero, second one in second 10, and so on until the fifth and last one is done in second 40.

So the real equation would be:

node-status-update-frequency x (N-1) != node-monitor-grace-period

More info:


5- Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s

This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.

6- Kube proxy has a watcher over the API, so the very first moment the pods are evicted the proxy will notice and update the iptables of the node, removing the endpoints from the services so the failing pods won’t be accessible anymore.

These values can be tweaked so you will get less failed requests if a node gets down.

I’ve set these in my cluster:.

kubelet: node-status-update-frequency=4s (from 10s)

controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)

The results are quite good, we’ve moved from a node down detection of 5m40s to 46s