Nasty ballooning kills your SQL performance

Nasty ballooning kills your SQL performance

This post might sound a bit like a game of Cluedo, but it happened on a production SQL Server…

But now you want to prove it
This post might sound a bit like a game of Cluedo, but it happened on a production SQL Server…
The issue showed itself on a hosted production VM with 64GB of Ram, where SQL could allocate memory from 32GB to 55GB. The agreements with the hosting partner clearly mentioned that while the CPU could be overcommitted, memory couldn’t.
All my alerts suddenly started to show that the machine was using all the available and became completely unresponsive (I couldn’t even open the Event Viewer to unload the logs).
Remote connection with management studio gave me this worrisome error:

140711b

But ping responses proved to be ok (less than 3 ms) so I didn’t suspect memory, page life expectancy gave a hint that something was certainly not OK

The Microsoft Performance dashboard showed worrying sings, so I RDP’d onto the server, and saw the horror of any SQL DBA this SQL Server was enduring Swapping Hell… whilst it should never even have to use its swap file… Ok something or someone murdered my SQL Server… let’s start the CSI investigation…

140711c

In the Recource Monitor we see that SQL is “only” getting ±20GB and is struggling to run. Swap file is overloaded and the server is running like a snail with arthritis.

140711d

So I suspected VMWare ballooning was the cruel pit here, but as this was especially agreed upon that this machine wouldn’t use memory ballooning I had to prove that it was happening Unfortunately as this is a hosted environment I could not check the ESX console to see the ballooning parameters, so I had to look for another way to find the smoking gun.
Luckily I found good documentation on the VMWare site explaining that there is a command file in the VMWare tools folder that just could to the trick, so I tried this on our murder(ed) victim and Hurray!  It gave me the proof I needed

140711e

So the murder weapon was known… VM Ware’s ballooning was to blame…

Cluedo summary
– Murder victim: Production Server
– Killed in VMWare ESX environment
– By the VM Ware ballooning service

Now I can go to the hosting partner and ask them to fix this, could become interesting because in the contracts, we especially agreed on not turning on ballooning on production SQL Servers 😉
On a side note, normally VMWare should have released memory when the (SQL) Server started to ask more resources, but somehow it didn’t, so I also send them the following link to a known issue where the Balloon driver does not releases its memory when using default parameters:
Balloon driver retains hold on memory causing virtual machine guest operating system performance issues (1003470)
I’m interested how this is going to go on 😉

On the other hand, the VMWareToolBoxCMD stat seems to be extremely useful, in many situations, so I pasted the most interesting ones from the complete list here: