Solving MongoDB connection losses on Windows Azure
Focusmatic uses Windows Azure Cloud Virtual Machines (VMs) to host its test and production infrastructures.
All in all we are pretty happy about it for all the reasons why cloud services are good for: resource elasticity, services isolation, unlimited storage, data resilience, quick deployments, etc… In addition to these Azure has a really sharp and well thought management interface.
We deploy our services in dedicated and right sized VMs. This to avoid nasty competition for local resources which leads to poor individual performances or even to random and unexplained crashes.
Nevertheless, these services have to communicate through various remote channels. And by default, VMs don’t see each other through the datacenter’s network. They can be reached only by their public DNS names: <VM name>.cloudapp.net. All remote calls are then going in the outside world to come back in the Azure network and face a gate keeper: the Load Balancer.
Azure’s load balancer manages communication between your Windows Azure application, which is running in specific Data Center and the external internet.
That’s where we got some pretty cranky troubles with MongoDB. Services relying on it started to lose connections for unexplained reasons:
- In java we got the following driver exceptions: com.mongodb.MongoException$Network: can’t call something
- And in NodeJS: it looked like the DB disappeared from time to time
However, when we restarted the services everything was always back to normal.
After some investigation we finally found out it was because the load balancer has a 1 minute connection timeout. Every time a connection is idle for more than one minute the load balancer automatically disconnects it.
We found 2 ways to solve this issue:
- You can move all your machines into an Azure Virtual Network: it forces static IPs between machine so all the remote calls are done inside the datacenter. It has also for advantage to not expose public endpoints anymore. If you want to learn more you can check this article.
- If you want to remain in the same network configuration and you have linux on your MongoDB instance you can actually set the tcp_keepalive_time parameter to something below 60 seconds. (`echo 45 > /proc/sys/net/ipv4/tcp_keepalive_time`). You can also check the following link.
If you have any questions feel free to contact the focusmatic dev team at email@example.com.