esx
esx
Our VMWare ESX server does us a great job.
Running on an IBM X3650 HW, with 24GB RAM and 2×4 cores, it can simultaneously run up to 25 virtual machines, each VM is configured with around ~1.5 GB of RAM.
After reaching  the 25 running VMs mark, we started noticing increasing sluggishness when additional VMs were turned on.
Of course, we did the trivial stuff of making sure that all screen savers are disabled, antivirus agents are not correlated to run at the same point in time, and making sure that all of the VMs are running the latest VMWare tools agent.
It was time to dig in deeper to find out where is the bottleneck we came across.
SLKNB_ASomeone told me that the stats that the reliability of the performance indicators that the graphic VI console shows is questionable and it’s recommended using the terminal utilities.So, I SHHed to the service console VM and ran the top utility. Immediately, I understood that what I’m actually doing is surveying the service console VM processes, rather than the overall ESX hypervisor activity. A quick dig up made me realize that the hypervisor is visible through the esxtop command, which is also executed from within the service console VM.
even for those of you that knows your way through the output of top and linux’s sysstat package, the data shown by esxtop is rather cryptic.
This great esxtop tutorial did me a great service with understanding the esxtop output.
I started more than 30 machines to reproduce the problem, and quickly went through the list of usual suspects: CPU, memory and IO:
  • CPU
    I’ve verified that it’s not a CPU problem since the “CPU load average” was around 0.2. and PCPU was much the same.
  • Memory
    Then I’ve switched to the memory display and verified that it’s not a physical memory issue. I saw the “high state” marker which was a good sign + there were almost 17GB ursvd (unreserved memory) in the VMKMEM/MB line.
    SWAP (~3GB) seemed OK.
    VMWare’s ballooning and memory sharing does miracles in broad day light.
  • I/O
    I didn’t see any queues forming. read/write rates seemed pretty low.
So, the 25 VMs performance limit will remain a mystery until I’ll have proper time to analyze it more throughly, or even better, I’ll find someone from IT to do that for me.