Here at Basefarm we run a large number of 24×7 business critical systems for our customers. So when one of our windows customers recently encountered an unexplained SQL Server crash on one such system, it was something that got our highest attention.
In my long experience with SQL Server (I’ve been working with it since 1997) it’s fairly rare to get a real crash, especially in the more recent versions. People often refer to crashes or incidents or outages, but it’s very rare that the actual windows process terminates unexpectedly. A classic example of this confusion is an unexpected cluster failover. Such an event may manifest itself as an outage, and obviously the process itself has stopped on one node and started on another, but in the majority of cases this is because something has happened within the database engine which makes the cluster resource monitor implement a failover. (more of that another time maybe) But in this case, this was not the scenario, the sqlservr.exe process just died suddenly with no warning.
In this case the only evidence and symptoms we had were 3 files in the log directory of the SQL instance. (Our monitoring detected no performance problems leading up to the crash for example.) If you get unexplained behavior in SQL Server it’s always a good idea to check out the log directory for these files. Here are the one we had:
These are the standard output from SQL Server in the later versions where critical errors are encountered. The files represent a memory dump (.mdmp), a textual representation of the memory dump (.txt), and a small extract from the SQL Server log file at the time of the crash (.log).
The number suffix is just a a unique naming convention in case SQL needs to create multiple dumps. In this case we had the pleasant view that this was the first dump file this SQL instance had ever created (0001) which in itself is a good sign of reliability. Over the years I’ve seen machines where they ran out of disk space because thousands of these files were created, and no-one ever noticed or bothered to look at them, so if you’re not checking your SQL machines for such evidence now, you should be!
The best and quickest thing to do here is to open the 2 text files and see what you can see, for example are there any interesting errors in the error log leading up to the crash. In this case though we had nothing, the error log was completely clean, only successful backup messages were being logged, so no clues there.
Then we open the text representation of the dump and we see this
This file is generated by Microsoft SQL Server
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.
Computer type is Intel(R) Xeon(R) CPU E5430 @ 2.66GHz.
Bios Version is HP - 2
DL 380 G5 v1.4 6-fan Woodcrest
8 X64 level 8664, 14 Mhz processor (s).
Windows NT 6.0 Build 6002 CSD Service Pack 2.
MemoryLoad = 89%
Total Physical = 49149 MB
Available Physical = 5092 MB
Total Page File = 98509 MB
Available Page File = 54094 MB
Total Virtual = 8388607 MB
Available Virtual = 8338574 MB
**Dump thread - spid = 0, EC = 0x0000000000000000
***Stack Dump being sent to XXXXXXXXXXXXXXXXXXXX\MSSQL\LOG\SQLDump0001.txt
* BEGIN STACK DUMP:
* 03/21/11 12:55:22 spid 2308
* ex_terminator - Last chance exception handling
I’ve highlighted a couple of the key points here. The first is the exact build number you’re running, the second is that a memory usage summary is printed out. This isn’t going to allow you to do any detailed analysis but at least proves that there is physical and virtual memory available on the box.
The third is the key point which is why this dump was generated. In this case it’s
“ex_terminator – Last chance exception handling“
This as I said before is a very uncommon error and pretty much unhelpful to us, as all it tells us is that SQL has written this dump out as part of its last chance exception handling when terminating. Often when you read these text representations of memory dumps, you actually see a useful message here, and sometimes even the SQL code which was running which caused the exception. An example of the better type of error you can get get is here on my old MSDN blog:
In this case we’re out of luck though, so our only option is to send the dump to Microsoft for analysis, as it says in the top of text file.
The bottom line is that there is a fix for this particular problem in CU3 for SQL 2008 SP2. (http://support.microsoft.com/kb/2498535) Applying this patch ought to stop the problem occurring in future. If you have a Microsoft support contract and are experiencing this problem though, I would strongly recommend that you get them to analyse the dump file for you to make sure it’s the exact same problem though, as it could theoretically be something else.
As you might have gauged from the above MSDN blog, you can often do rudimentary reviews of SQL Server dump yourself with a debugger if you want to know what’s in them. There’s a limit to how far you can take this due to the fact that SQL Server only releases public symbols on the Microsoft symbol server. Normally in scenarios such as this I would decode the dump myself and get out a thread stack meaning that people reading this post could check that they have the exact same problem. However unluckily in this case the correct public symbols haven’t been released onto the Microsoft symbol server so I was unable to get the stack out.
We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.