In a previous posting I provided links to references for Enabling the Lock Pages in Memory Option on 64 bit Windows Servers. I recently learned this is not compatible with Servers running in a virtualized environment. After enabling this on two servers that use physical hardware, without problems, our server admin agreed that we should probably implement it on all of our 64 bit SQL Servers following the Microsoft recommended practice, so we made the change at the end of the week last week.
Everything was ok, until Sunday just before noon, when I received a series of alerts on my Treo that connection tests to one of our primary SQL Servers were failing. Though not recommended or supported, we run SQL Server on VMWare 3.5 for Disaster Recovery, and we have never had trouble with it. This server happened to be one of the virtual servers, so I logged into work, and found that the server was completely dead, no ping response, no RDP Access, so I jumped onto the VM Client Console and it was blue screened. Not good, but at least I can restart it and start looking for what might have caused this. However there are no logs, not dumps, nothing.
Unfortunately I have learned that sometimes in IT the unexplainable happens and then doesn't reoccur. A phone call to our server admin and some digging by him didn't turn anything up either, so we are stuck with no answer, just wait and see if it happens again. The performance charts for the server didn't hold any clue to what could have occured on the server. It was moving along like normal then dead.
Monday morning the server is online and running just fine. Somewhere around 10:30am one of the other Server Administrators did a VMotion of the server to move it from one host to another. This has been a standard practice by the server team for a few years, to do load balancing or maintenance on the Virtual Hosts, and there have never been problems with this before. Generally the memory cache flushes out, and has to re-grow when they do this, so it is not often that they move a SQL Server from one host to another. However, this time, the server immediately lost all but 1/4 of its normal memory, and it would not recover from this. Instead it began paging everything to the Virtual Swap Memory on disk, until it finally crashed completely. Since I monitor for high memory pressure, I never received an alert of the problem, something I have since corrected, and the server limped along for just over 2 hrs before it died completely. Not one complaint for poor performance was received during this time span, interestingly enough.
The first thing we did was disable the last change made which was locking pages in memory and restart the server. Now it is a waiting game to see if the problem repeats before we can get a scheduled outage at night to take the server offline and really look it over. Later that night we couldn't repeat the problem by moving the server, so I guess problem solved? To find out for sure we decided to move a developer server with the option set and see what happened. Since there was a delay between the move and crash previously on a high load server and the developer servers have little load on them, I wrote a process to generate load against the server and enabled it. It took just over 16 hrs, but the server crashed just like the production server.
If you run SQL Server on VMWare 3.5 and you use VMotion to move servers, do not Enable Locking Pages in Memory.
No comments:
Post a Comment