Heads up! Running mongos in containers

Mongos is the MongoDB query router (https://docs.mongodb.com/manual/reference/program/mongos/) and all the recommendations tells you that you should run a mongos process locally along with the service it is going to use it.

For traditional applications that’s totally fine, but when you run it in containers you need to be aware of two things.

1- The dynamic nature of containers make the usage of mongos a bit inefficient

2- Mongos is not cgroups aware

Let me go into details:

Continue reading

Advertisements

Where to set readahead: LVM, RAID devices, device-mapper, block devices?

You want to set readahead to tune the performance of you disk reads and you find that in your server there are several levels of devices, block devices, RAID devices, then LVM with device-mapper, etc.

You can set the readahead in any of them, which one is the right one?

I came up with this Stackoverflow question: https://serverfault.com/questions/418352/readahead-settings-for-lvm-device-mapper-software-raid-and-block-devices-wha

And i decided to do some tests to prove what wojciechz was saying, and he is right, let me show you:

My setup is a server with RAID10 and LVM with a /db partition mounted on the logical volume

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid10 sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0]
3906525184 blocks super 1.2 512K chunks 2 near-copies [8/8] [UUUUUUUU]

# pvdisplay
--- Physical volume ---
PV Name /dev/md127
VG Name vg1
PV Size 3.64 TiB / not usable 0
Allocatable yes
PE Size 4.00 MiB
Total PE 953741
Free PE 489746
Allocated PE 463995
PV UUID KH4RjS-lgAN-2OdI-hiYQ-HuR1-naDM-nSmc5S

# mount | grep db
/dev/mapper/vg1-db on /db type ext4 (rw,noatime,nodiratime,discard,stripe=512,data=ordered)

Continue reading

Pykube now supports Google Cloud Platform clusters (OAuth2)

I’ve been contributing to Pykube project (https://github.com/kelproject/pykube/) recently to add support for Google Cloud Platform clusters.

Kubernetes has multiple ways of authentication, and Pykube was supporting Bearer Token, Basic Auth and X509 client certificates.

For our use case where we manage 4 Kubernetes clusters (2 baremetal in our datacenter and 2 in GCP) and we want to automate them all, we need this feature to be available.

GCP uses Bearer tokens to authenticate, but those tokens are generated by Google, and they expire after one hour. So, getting the token with kubectl and then using Pykube with Bearer token auth was not enough due to its expiration.

Instead, now Pykube supports full OAuth2 authentication that fetchs the token from GCP if it is not set or if it is expired, the same as kubectl does.

Both user and service GCP accounts work with this library but you need to set your gcloud credentials to make it work.

How to set it up?

Continue reading

Improving Kubernetes reliability: quicker detection of a Node down

In your Kubernetes cluster a node can die or reboot.

This kind of tools like Kubernetes are high available and designed to be robust and auto recover in such scenarios, and Kubernetes accomplish this very well.

But, you might notice that when a Node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests, will fail.

That time can be reduced, because in my opinion, by default is too high. There are a bunch of parameters to tweak in the Kubelet and in the Controller Manager.

This is the workflow of what happens when a node gets down:

1- The Kubelet posts its status to the masters using –node-status-update-frequency=10s

2- A node dies

3- The kube controller manager is the one monitoring the nodes, using -node-monitor-period=5s it checks, in the masters, the node status reported by the Kubelet.

4- Kube controller manager will see the node is unresponsive, and has this grace period –node-monitor-grace-period=40s until it considers the node unhealthy. This parameter must be N times node-status-update-frequency being N the number of retries allowed for the Kubelet to post node status. N is a constant in the code equals to 5, check var nodeStatusUpdateRetry in https://github.com/kubernetes/kubernetes/blob/e54ebe5ebd39181685923429c573a0b9e7cd6fd6/pkg/kubelet/kubelet.go

Note that the default values don’t fulfill what the documentation says, because:

node-status-update-frequency x N != node-monitor-grace-period   (10 x 5 != 40)

But what i can understand, is that 5 post attempts of 10s each, are done in 40s, the first one in second zero, second one in second 10, and so on until the fifth and last one is done in second 40.

So the real equation would be:

node-status-update-frequency x (N-1) != node-monitor-grace-period

More info:

https://github.com/kubernetes/kubernetes/blob/3d1b1a77e4aca2db25d465243cad753b913f39c4/pkg/controller/node/nodecontroller.go

5- Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s

This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.

6- Kube proxy has a watcher over the API, so the very first moment the pods are evicted the proxy will notice and update the iptables of the node, removing the endpoints from the services so the failing pods won’t be accessible anymore.

These values can be tweaked so you will get less failed requests if a node gets down.

I’ve set these in my cluster:.

kubelet: node-status-update-frequency=4s (from 10s)

controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)

The results are quite good, we’ve moved from a node down detection of 5m40s to 46s

 

Apache mpm_event module running out of slots when reload

I’m finally back, sorry for my long absence, but i moved a year ago to San Francisco to work at ThousandEyes and my life is quite busy at this moment.

So, it seems that the new Apache version 2.4 stopped considering the mpm_event module as experimental and changed it to stable. But i don’t think is as stable as it should be, at least in Ubuntu Trusty, i haven’t tried other distros.

The mpm_event module is basically an improvement of the mpm_worker, it changes how the requests are handled by the Apache threads by creating a main thread to listen all of them and delegating the actual work to other threads freeing the main thread to attend other requests.

There is a bug, easy to reproduce, that makes your Apache server run out of slots to attend requests, if you execute an Apache reload (like the logrotate conf that the Ubuntu Apache package has…) some slots become G in the Apache scoreboard, meaning “Gracefully finishing”, which is expected, the problem is that some of them never comes back as an available slot.

To reproduce it, open two consoles, in the first one:

$ while true ; do service apache2 reload; done

And in the second one:

$ watch -n 1 "apache2ctl status | tail -n30"

You will need some requests to hit your server, because i suspect that the slots that hangs in “Gracefully finishing” are the ones that had an open connection.

Continue reading

Speaker at Codemotion@Madrid 2013

The Codemotion conferences will be held in October the 18/19 in Madrid and i will be one of the speakers talking about the release, integration and development process of Tuenti.

Check this out:

http://codemotion.es/talk/19-october/101

In a nutshell, the talk will briefly summarize a blog post series i’m writting for our company developer’s blog regarding the development, integration and release workflows, how they were in the past and how they evolved to now be fast, reliable and with almost no human intervention.

These are the blog posts (3 so far, the 4th will be very soon)

The Tuenti release and development process blog post series

It’s time to announce a blog post series i’m publishing in the Tuenti developers blog.

It’s about the release and development process and i’ll try to explain how we work internally in our company, since a developer starts programming till the code goes to the production servers, passing through the development environment, Jenkins, the continuous integration and delivery and some internal tools we’ve developed to automate and ease the process.

This first part of the series just makes an introduction of what will come further on and shows the differences of how Tuenti was in the past (4 years ago approximately) and how is now.

In a nutshell, before there were many manual and error prone tasks the ended up in bugs in the site and now, everything is fast, reliable and automatic, with no manual intervetion at all.

http://corporate.tuenti.com/es/dev/blog/the-tuenti-release-and-development-process-once-upon-a-time

I will announce the forthcoming posts here.

Enjoy!