Posts

Configuring Windows Server 2008 R2 Features

At Basefarm we frequently need to ensure that many Windows servers are identical in terms of the roles and features they have installed. Adding features can be done in a number of ways. Mostly the graphical userinterface (Server Manager) is used. Or for large operations System Center or similar. I will show you how this can be done more easily using the command line. This method doesn’t require anything beyond Windows Server 2008 R2 (or later) and PowerShell.

The Server Manager module

The Server Manager module (introduced with Windows Server 2008 R2) has three very useful commands, they are:

  • Add-WindowsFeature
  • Get-WindowsFeature
  • Remove-WindowsFeature

Using these is simple. Start a PowerShell session with administrative privileges (Run As…) . Then check that the Servermanager module is available in your server:

PS C:\> Get-Module -ListAvailable

Get-Module -ListAvailable

This shows that the Server Manager module is available on our server but that it is not yet loaded into the PowerShell session. To load it (and make its commands available):

PS C:\> Import-Module Servermanager

Now the commands of the Server Manager module are available to you. Check which commands are exposed by the module:

PS C:\> Get-Command -Module Servermanager

Ok, we’re all set. Let’s use these commands!

HOWTO: Document what is installed

To see what is installed in a server use:

PS C:\> Get-WindowsFeature

Get-WindowsFeature

ooops, that’s a lot of text flying by on the screen! As you probably can guess only lines with [X] are installed. So we need to filter the list to only show what is actually installed, try this instead:

PS C:\> Get-WindowsFeature | ? { $_.Installed }

Get-WindowsFeature-installed

A nice clean list showing which features are installed on the server ;-), perfect for documenting your server(s)

HOWTO: Clone installed features to another server

As shown above it’s easy to list what is installed. But just having this list on the screen doesn’t make much sense, we need to be able to store this in a structured way so that we can use the list on another server to install the same features. PowerShell makes this very simple. We use the Export-CliXml cmdlet to save the information in a structured XML file:

PS C:\> Get-WindowsFeature | ? { $_.Installed } | Export-Clixml .\features.xml

The output from the Get-WindowsFeature cmdlet is saved in a structured way in the XML file features.xml. This file can now shared to other servers and used as input for the Add-WindowsFeature cmdlet!

HOWTO: Add features from another server (using XML file)

Start PowerShell with administrative privileges.  Now try this:

PS C:\> Import-Module Servermanager
PS C:\> Import-Clixml .\features.xml

Now you have the same list of installed features on the new server. But… this is simply a list in memory and on screen. The features haven’t been added yet. In order to do that we need to pipe the information into the Add-WindowsFeature cmdlet.

Before I show you how to do that there is one important thing I need to explain. When we exported the list of installed features we included all features that were marked as installed. As you saw in the output this resulted in a tree like structure where “[X] Web Server (IIS)” was on the top followed by “[X] Web Server” and so on.

That looks fine but if we use this as input for the Add-WindowsFeature cmdlet we will end up with more than we asked for. The reason is that when the top level feature such as “Web Server (IIS)” is choosen everyting underneath it will also be installed. And in order to keep our servers a lean as possible we do not want this! We need to go back and filter the output of Get-WindowsFeatre a little more. Try this instead of what I showed you earlier:

PS C:\> Get-WindowsFeature | ? {$_.Installed -AND $_.SubFeatures.Count -eq 0 }

Now the output will only contain information from the bottom-up so to speak. This works fine as input for the next server we want to make identical. Save the new list to a file:

PS C:\> Get-WindowsFeature | ? {$_.Installed -AND $_.SubFeatures.Count -eq 0 } | Export-Clixml .\features.xml

Now we can finally install these features in the new server:

PS C:\> Import-Clixml .\features.xml | Add-WindowsFeature

Est Voilá! The two servers now have the same Windows features installed.

As always with PowerShell, if your environment enables PowerShell remoting these commands could be executed on any number of servers from a single commandline. A Power(full)Shell that is!

Summary

This became a longer post than I intended simply because I wanted to explain the details about filtering the export. Here’s a Quick summary of the commands you use to export what is installed:

PS C:\> Import-Module Servermanager
PS C:\> Get-WindowsFeature | ? {$_.Installed -AND $_.SubFeatures.Count -eq 0 } | Export-Clixml .\filename.xml

Copy the file ‘filename.xml’ to a network share or other location where the next server can reach it, then do this on the other server:

PS C:\> Import-Module Servermanager
PS C:\> Import-Clixml .\filename.xml | Add-WindowsFeature

All features are installed on the new server without having to click-around in the graphical server manager! To verify what is installed quickly use:

PS C:\> Get-WindowsFeature | ? { $_.Installed }

I hope I have showed you that PowerShell is much better than giving your arms RSI using the mouse to handle feature installations!

SQL Server setup fails due to partitioned network warnings from cluster service

I was building a new SQL Server 2008 R2 failover cluster recently and encountered a problem that I hadn’t seen before (which is rare as I’ve seen A LOT of cluster setup problems in my time!). This time it was strange as it was an error before setup actually ran, it was when I was going through the dialogue boxes to configure setup.

The scenario was this:

1. Cluster was fully built and validated at a windows level, all resources were up and OK
2. I was about to run SQL Setup when I noticed the network binding order was wrong
3. I changed this and then decided to reboot both nodes as I always do this before a cluster setup
4. The nodes came back online OK and all resources came up as well
5. I ran setup but when I got to the cluster network configuration dialog box, there were no networks to select from, so you couldn’t go forward.

My first thought was that I must have done something dumb when changing the network binding order but checks on the network adapters showed that they were all up. I then went back through a few other things and noticed that the cause of the error was actually that the cluster service was having issues with connecting to one of the networks. There were 2 types of error / warning in the cluster logs and the system event logs:

Error

Cluster network ‘xxxxx’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Warning

Cluster network interface ‘xxxxx – xxxxx’ for cluster node ‘xxxxx’ on network ‘xxxxx’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

I had to engage the help of some network specialists as I couldn’t get to the bottom of this on my own. The networks actually appeared up and we could connect to them and use them independently outside of the cluster, but the cluster was convinced that they were partitioned. To cut a long story short, after checking many things we realised that the problem was down to the fact that one of the networks was actually a teamed network implemented using BASP virtual adapters, and this network team was not coming up fast enough after the node rebooted, before the cluster service tried to bind it in as a resource.

The fix was simple, in that we set the cluster service to delayed start and then everything was fine. We didn’t need to make any configuration changes beyond this. Once the cluster service was happy that the network was OK, SQL Server setup was able to continue just fine.

Good luck with your cluster builds!

Tracing select statements on specific objects in SQL Server without using Profiler

A developer of an application asked me an interesting question the other day. He wanted to know (for reasons not worth going into here) whether his application issued select statements against a specific table in his database. This database was in production and under heavy load, so although we could run a server side SQL Profiler here, and then read through the results, this could be a time consuming process which could generate an extremely large amount of data (and also a quite heavy load on the server). We also wanted to run this monitoring for a number of days if possible, so we needed something more lightweight if possible.

I thought about this for a while and realised that the best way to achieve this (assuming you are running SQL 2008 or later) would be through the new functionality SQL Audit. This uses the extended events framework as the basis for its tracing and therefore falls into the lightweight category.

Here’s an example of what I wrote, converted into simple test objects which you can try yourself. This example requires a table called dbo.audit_test to be present in a database named audit_db for you to test against.


USE master ;
GO
-- Create the server audit.
CREATE SERVER AUDIT test_server_audit
TO FILE ( FILEPATH =
'C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA' ) ;
GO

— Enable the server audit.
ALTER SERVER AUDIT test_server_audit
WITH (STATE = ON) ;
GO
— Move to the target database.
USE audit_db ;
GO

— Create the database audit specification.
CREATE DATABASE AUDIT SPECIFICATION audit_test_table
FOR SERVER AUDIT test_server_audit
ADD (SELECT , INSERT, UPDATE
ON dbo.audit_test BY dbo,guest, public, db_datareader)
WITH (STATE = ON) ;
GO

/*
do some work here
which will trigger the audit to record something
*/

/* uncomment these statements to turn off the audit at either DB or server level

–turn off the database audit
use audit_db
go
ALTER DATABASE AUDIT SPECIFICATION audit_test_table
WITH (STATE = OFF)

use master
go
–turn off the server audit
ALTER SERVER AUDIT test_server_audit
WITH (STATE = OFF) ;
GO

*/

Here’s the key things to note about the above example:

1. This one actually traces 3 type of table access SELECT, INSERT and UPDATE
2. It traces specific users and groups – you can change these as relevant to your example
3. It writes the output to the default DATA directory of a default 2008 R2 install – change as you see fit
4. You need to watch the potential file space this will take up as it can be very verbose in big systems
5. Watching the file space used in real time will not work, as it holds most of the data in memory and flushes when you stop the trace

Once you have the output you need (and you have turned off the audit – don’t forget!) you simply run something like this to view the data (you’ll need to locate the exact file name created each time you turn the audit on or off).


SELECT COUNT(*), statement
FROM sys.fn_get_audit_file ('C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\test_server_audit_7E707DDD-03F3-4FFA-B24B-BB0DDBF4D5F3_0_129714455341990000.sqlaudit',default,default)
group by statement
GO

As you can see the above does a simple count and aggregate of the results, but there are many columns in the output which you can write TSQL against (although since it’s backed by a file the access might be slow if you have large files or slow disks!).

I found this to be a most effective technique and it didn’t impact the performance of the server whatsoever.

You can read more about SQL Server Audit at MSDN.

SQL Server 2008 R2 setup fails due to invalid credentials

A colleague was trying to install a SQL Server 2008 R2 standalone instance today and he kept hitting a weird error when passing the credentials for the service accounts. He was running the setup as a local administrator on the server in question and he was trying to add a domain user as the service account.

The server in question was joined to the appropriate domain already and it had been checked that all the appropriate firewall ports were open. We knew that the user account he was using could query the domain structure as it was able to browse the domain to select the user account from the setup screen.

The errors we got for both accounts were (the text is truncated on the form)

The credentials you provided for the SQL Server Agent service are invalid. To continue…

The specified credentials for the SQL Server service are not valid. To continue…

We took a methodical troubleshooting approach and tested some other domain accounts which we knew were valid and were running other SQL instances elsewhere. These failed as well, meaning that we must have been encountering either some unexpected behaviour within this setup session, or we were being blocked from talking to the domain controllers in some fashion. We again checked the firewalls and they were confirmed as OK.

Then we went into the set logs. The summary log just showed that we had cancelled the installation. For example you see stacks like this which tells you nothing really

2011-09-12 15:59:21 Slp: Exception type: Microsoft.SqlServer.Chainer.Infrastructure.CancelException
2011-09-12 15:59:21 Slp:     Message:
2011-09-12 15:59:21 Slp:         User has cancelled.

However if you look in the detail.txt log file, contained one directory lower, you can scroll from the bottom up and find the actual cause of the problem in the page validation routines. It looks like this (I’ve removed the timestamps for better readability and also blanked all the identifying information obviously)

SQLEngine: –InputValidator: Engine : Attempting to get account sid for account DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account from sid S-1-5-21-999999999-999999999-999999999-9999
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN
Slp: Sco: Attempting to get sid for user account DOMAIN
Slp: Sco: GetSidForAccount normalized accountName DOMAIN parameter to DOMAIN
SQLEngine: –InputValidator: Engine : Service Acccount Specified, Validating Password
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to validate credentials for user account DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN
Slp: Sco: Attempting to get sid for user account DOMAIN
Slp: Sco: GetSidForAccount normalized accountName DOMAIN parameter to DOMAIN
Slp: Sco: Attempting to see if user DOMAIN\account exists
Slp: Sco.User.OpenRoot – Attempting to get root DirectoryEntry for domain/computer ‘DOMAIN’
Slp: Sco: Attempting to check if user account DOMAIN\account exists
Slp: Sco: Attempting to look up AD entry for user DOMAIN\account
Slp: Sco.User.OpenRoot – root DirectoryEntry object already opened for this computer for this object
Slp: Sco.User.LookupADEntry – Attempting to find user account DOMAIN\account
Slp: Sco: Attempting to check if container ‘WinNT://DOMAIN’ of user account exists
Slp: UserSecurity.ValidateCredentials — Exception caught and ignored, exception is Access is denied.
Slp: UserSecurity.ValidateCredentials — user validation failed

I’ve highlighted the problem section. As you can see our account has some permissions on the domain and successfuly gets the SID and various other tasks. However when it comes to the method

Attempting to check if container ‘WinNT://DOMAIN’ of user account exists

it fails…and moreover it then swallows the exception, and then to my amusement actually records that it’s swallowed the exception! To me this is really strange, I guess you could argue that the exception is security related and therefore is swallowed for security protection, but then it records what the error is in the log file, so this seems rather unintuitive to me. The bottom line here is that the account in question doesn’t have the specific privileges on our domain that SQL setup wants here, and so it fails and reports that the service account is invalid. In fact the service account is not invalid, the account used to lookup the service account is invalid. In my mind what should really happen here is that you should get a standard windows AD credentials challenge as the process has caught and handled an access denied error, meaning that it could present this information to the user. But hey, that’s just my opinion.

At the end of the day we changed the setup to use a different account with higher privileges (by running setup with a different logged on user) and everything worked just fine. The key here is that the error is misleading, it’s the interactive account under which you are running setup which has the problem, not the service account you’re trying to add.

SQL Server 2000 cannot start after windows update reboot

Last night we had an incident with one of our customers’  old SQL Server 2000 instances. The machine in question had had it’s WSUS windows update run last night and had been forced to reboot after this had occurred. After this happened the SQL Server service refused to start and just got stuck in a cycle of permanent restarts. When we looked at the error logs we saw the following repeated symptom.

2011-09-12 11:33:00.59 server    Microsoft SQL Server  2000 – 8.00.2039 (Intel X86)
May  3 2005 23:18:38
Copyright (c) 1988-2003 Microsoft Corporation
Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

2011-09-12 11:33:00.59 server    Copyright (C) 1988-2002 Microsoft Corporation.
2011-09-12 11:33:00.59 server    All rights reserved.
2011-09-12 11:33:00.59 server    Server Process ID is 1284.
2011-09-12 11:33:00.59 server    Logging SQL Server messages in file ‘C:\Program Files (x86)\Microsoft SQL Server\MSSQL\log\ERRORLOG’.
2011-09-12 11:33:00.59 server    SQL Server is starting at priority class ‘normal'(4 CPUs detected).
2011-09-12 11:33:00.60 server    SQL Server configured for thread mode processing.
2011-09-12 11:33:00.62 server    Using dynamic lock allocation. [2500] Lock Blocks, [5000] Lock Owner Blocks.
2011-09-12 11:33:00.68 server    Attempting to initialize Distributed Transaction Coordinator.
2011-09-12 11:33:02.84 spid3     Starting up database ‘master’.
2011-09-12 11:33:02.85 server    Using ‘SSNETLIB.DLL’ version ‘8.0.2039’.
2011-09-12 11:33:02.85 spid5     Starting up database ‘model’.
2011-09-12 11:33:02.85 spid3     Server name is ‘XXXXXXXXXX’.
2011-09-12 11:33:02.85 spid8     Starting up database ‘msdb’.
2011-09-12 11:33:02.85 spid9     Starting up database ‘removedforsecurity’.
<SNIP for brevity>
2011-09-12 11:33:02.85 spid23    Starting up database ‘removedforsecurity’.
2011-09-12 11:33:02.85 spid5     Bypassing recovery for database ‘model’ because it is marked IN LOAD.
2011-09-12 11:33:02.85 server    SQL server listening on <removedforsecurity>: xxxx.
2011-09-12 11:33:02.85 server    SQL server listening on <removedforsecurity>: xxxx.
2011-09-12 11:33:02.87 spid20    Starting up database ‘removedforsecurity’.
2011-09-12 11:33:02.89 spid5     Database ‘model’ cannot be opened. It is in the middle of a restore.

The important line here is the last one. Model cannot be opened as it’s marked IN LOAD, this in turn means that tempdb can’t be created which in turn means that the service cannot start.

Quite how model became to be status IN LOAD we’ll never know. I went back through the logs and there was nothing suspicious and no-one has actually attempted to restore it. Circumstantially the evidence points to something being corrupted by windows update, as this is when the problem started, but retrospectively we’ll never be able to say. The MDF and LDF files themselves were intact and seemingly OK, so it was time for a manual attempt to try and get the service back.

The way to recover this (and to troubleshoot it initially) is rather dirty. You have to start the server in single user mode at the console and pass it a couple of trace flags to get you into the system catalogues. You should only ever do this as a last result, but this was a last resort, as it was either do this, or system state restore the entire machine. (I couldn’t restore the SQL Server database as it wouldn’t actually start.) The nature of starting a service in console / single user mode is common enough for SQL machines that won’t start, as you get stuck in a chicken and egg scenario of needing to start it to see why it won’t start! However update the system tables are your peril. In later versions of SQL Server (2005 upwards) the system tables are actually obscured behind views in a separate protected database to make this harder to do, although you can still do it.

The following KB article has a good description of how to do this.

http://support.microsoft.com/kb/822852

The symptoms are the same but the cause was different. That said the solution is also the same in that you have to manually update the status of the database in sysdatabases and then restart the server. There’s no guarantee of success here as it could have been that the MDF file was actually corrupted, but luckily for me it did work and I was saved during a full system restore of the server.

SQL Server TSM Restore appears to hang or takes much longer than expected

I’ve written previously about the dangers of VLF fragmentation, but the problems I’ve actually experienced before were always related to log based operations, i.e. recovery phase after a crash or database mirroring, but last week I saw an alternate issue where doing a full restore from IBM Tivoli Storage Manager (TSM).

At the start I can say the same thing that I always say when writing about this subject

Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.

That said, here’s an example of what happens when you don’t, and more importantly how to fix it.

The Symptoms

You’re trying to do a full database restore via TSM (GUI or command line)

The restore takes much longer than you expected

The progress bar in TSM GUI says that the restore is 100% complete, and yet it is still running

If you run a script similar to this one, it also says 100% complete, but the restore still runs

SELECT r.session_id,r.command,CONVERT(NUMERIC(6,2),r.percent_complete)

AS [Percent Complete],CONVERT(VARCHAR(20),DATEADD(ms,r.estimated_completion_time,GetDate()),20) AS [ETA Completion Time],

CONVERT(NUMERIC(10,2),r.total_elapsed_time/1000.0/60.0) AS [Elapsed Min],

CONVERT(NUMERIC(10,2),r.estimated_completion_time/1000.0/60.0) AS [ETA Min],

CONVERT(NUMERIC(10,2),r.estimated_completion_time/1000.0/60.0/60.0) AS [ETA Hours],

CONVERT(VARCHAR(1000),(SELECT SUBSTRING(text,r.statement_start_offset/2,

CASE WHEN r.statement_end_offset = -1 THEN 1000 ELSE (r.statement_end_offset-r.statement_start_offset)/2 END)

FROM sys.dm_exec_sql_text(sql_handle)))

FROM sys.dm_exec_requests r WHERE command IN (‘RESTORE DATABASE’,’BACKUP DATABASE’)

It can be several hours (or even days) in this state.

In the error log all you see are rows indicating that the restore has started

Starting up database ‘xxx’.

The explanation

It’s most likely that your database log file has become logically fragmented into many virtual log files (VLF). Many means different things to different systems, but more than 1000 can definitely be a problem. In the problem I encountered last week it was 17000 which made a 25 minute restore take 3 hours longer than expected.

If you’re unfamiliar with the principals of VLF you should read the following:

Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.

If you want to check any of your critical database now to see whether you have this fragmentation, you can run the following:

DBCC LOGINFO (‘xxx’)

This is one of those commands that is officially undocumented, but that everyone actually uses! It’s been raised on the connect site to have it moved into a DMV in the future

https://connect.microsoft.com/SQLServer/feedback/details/322149/msft-mso-document-dbcc-loginfo-or-create-new-dmvs-to-view-vlf-info

I’ve run it innumerable times on production databases though (normally when fixing problems like this).

If the rows returned is greater than 1000 you might have a problem. To test whether you have a problem all you need to do is attempt to restore a backup of the database (you can do this in a test server) and see if you experience an unreasonable delay. If you do , then I would recommend you try to fix it.

The solution

You need to truncate and shrink the log back to a point where the fragmentation does not occur. The great people at SQL Skills have a very full article on this here:

http://www.sqlskills.com/blogs/kimberly/post/8-Steps-to-better-Transaction-Log-throughput.aspx

The (very easy) script to fix it is right at the end, but if you’re in this position and you’ve come this far, I’d recommend that you read the above article in full to understand why you got where you are.

Until the next time….

Graham

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.

SQL Server deadlocked schedulers leads to unexpected cluster failover–Part 2

In Part 1, I looked at how one could identify that a large blocking chain was causing resource exhaustion in SQL Server leading to a cluster failover. Being as we were working retrospectively, we were lacking data to analyse what actual condition was causing the blocking chain to begin with. This post address that area of the troubleshooting.

The problem had occurred a number of times in this environment, and no-one had ever managed to log into the machine to analyse it before it failed over, so I wanted to put something in place proactively to ensure we caught the condition as it happened. For reference had I been able to log into the machine I would have simply run the SQL 2008 Perfstats scripts as a starting point, which I tend to keep handy on my critical production machines, just in case I ever need it in a hurry. I just run the script for a couple of minute thru SSMS and you often get a good look at the problems. You can download the scripts from codeplex, as they are held as part of the SQLNexus project.

However in this case I have the opportunity to capture exactly what I want as I’m planning to setup everything in advance. This means that it’s a perfect opportunity to use SQLDIAG to collect all the data I want in one simple place. If you don’t know what SQLDIAG is, then read these first:

Performance Problems in SQL Server – What data should I collect?

http://msdn.microsoft.com/en-us/library/ms162833.aspx

The good thing nowadays (since the above were written) is that there are a couple of automated ways to get things going, as opposed to manually editing XML files. Personally I do still sometimes do manual edits depending upon how many customizations of my own I want to put in, but we also have the options of using the BAT files supplied in the SQLNEXUS project or I’ve recently noticed that a SQLDIAG configuration tool has been released.

The bottom line here is that I configured a SQLDIAG to collect the following:

1. Perfstats scripts

2. Perfmon of common SQL and OS counters

3. DBCC MEMORYSTATUS (I always like to include this as it has such a low overhead and can be incredibly useful if you get unexpected internal problems like compilation bottlenecks)

4. General base information (defaults from SQLDIAG)

5. Profiler – restricted to batch level events, as this is a heavily used system.

***IMPORTANT NOTE***

You need to show extreme care with running profiler nowadays on some of the more high end SQL machines. In the past 2 year I’ve seen a couple of enterprise sharepoint systems crashed through the use of excessive SQL Profiler traces. If you’ve not run profiler before on your system and you have very high throughput, make sure you do a test run of just a server side profiler first at a very high level capturing very few events first, to check that your server and disks can handle the throughput.

***IMPORTANT NOTE***

Since I didn’t know when the problem would occur next, I then needed to implement a routine to manage the SQLDIAG start and stop times and file output. The BOL documentation above lists the switches and syntax you need to do this, with examples.

So I put all this in place and then waited for the problem to occur again, which is promptly did within a few days.

ANALYSIS:

Anyone who analyses performance problems within any frequency will probably tell you that there’s no right or wrong way to go about this, but in this case since I’m pretty certain I’m looking for a blocking problem, my first port of call is the perfstats scripts output. I’m suspecting that I’ll only need the other data to backup my findings, but I’ve got it there just in case.

Perfstats output is normally my first port of call in most performance cases anyway, as my first question is:

Is there a user batch running for a long time?

The answer to this question often removes whole areas of troubleshooting possibilities, so I like to answer it first. If the answer is yes, you can look at said batches and try to work out why, and if the answer is no, then you need to be looking at different areas such as internals. In this case, very clearly from perfstats, there was a long running batch. I’m limited to how much output I can show here for security reasons, but here’s some sample output from the [headblockersummary] section of perfstats.

(This wide output doesn’t display well on our fixed width blog, so if you want to read it, cut and paste it into a decent text editor like notepad ++ )

— headblockersummary —
runtime                        head_blocker_session_id blocked_task_count tot_wait_duration_ms blocking_resource_wait_type              avg_wait_duration_ms max_wait_duration_ms max_blocking_chain_depth
—————————— ———————– —————— ——————– —————————————- ——————– ——————– ————————
2011-02-20T14:36:25.090                            661                  1                84252 LOCK BLOCKING                                           84252                84252                        1
2011-02-20T14:36:25.090                             70                  1                 2899 LOCK BLOCKING                                            2899                 2899                        1

— headblockersummary —
runtime                        head_blocker_session_id blocked_task_count tot_wait_duration_ms blocking_resource_wait_type              avg_wait_duration_ms max_wait_duration_ms max_blocking_chain_depth
—————————— ———————– —————— ——————– —————————————- ——————– ——————– ————————
2011-02-20T14:36:45.507                            347                  3               105468 LOCK BLOCKING                                           35156               105059                        3
2011-02-20T14:36:45.507                             70                  2                36169 LOCK BLOCKING                                           18084                23706                        2
2011-02-20T14:36:45.507                            347                 52                 1630 PAGELATCH_* WAITS                                          31                   98                        3
2011-02-20T14:36:45.507                            391                 20                  318 PAGELATCH_* WAITS                                          15                   60                        1
2011-02-20T14:36:45.507                            393                 12                  129 PAGELATCH_* WAITS                                          10                   17                        1
2011-02-20T14:36:45.507                            377                  1                   37 PAGELATCH_* WAITS                                          37                   37                        1

— headblockersummary —
runtime                        head_blocker_session_id blocked_task_count tot_wait_duration_ms blocking_resource_wait_type              avg_wait_duration_ms max_wait_duration_ms max_blocking_chain_depth
—————————— ———————– —————— ——————– —————————————- ——————– ——————– ————————
2011-02-20T14:40:36.170                             70                449             52689330 LOCK BLOCKING                                          117348               228943                        2

 

If you note the highlighted sections above, we can see that SPID 70 starts causing some blocking at 14:36:25, but it’s only blocking 1 session and there is other blocking anyway, so it appears transitory. This code is a trigger firing and it needs to do a large update which is going to take a few minutes (I can’t show the code or why this is, that’s just the way this application works)

20 seconds later SPID 70 is still blocking, but only directly blocking 2 sessions. Now there are about 80 other SPIDs blocked, and then are effectively behind SPID 70 deeper down the blocking chain. However these then clear and so the blocking still seems transitory, if we look at the web server logs over this period, the web servers are still up and the majority of all OLTP throughput is being served within normal limits.

But then 4 minutes later we see the same SPID 70 but this time its directly blocking 449 other sessions, and it’s at this point that we start to see alerts from our monitoring of the application servers. Over the next minute this blocking chain grows larger and larger until we hit the deadlocked schedulers and then shortly after that the cluster fails over.

So at first glance, we have a simple problem here. SPID 70 seems to have a long running piece of code which blocks all users on the system. But upon further investigation this trigger fires regularly without blocking the entire system (we saw leading up to the failure that it was running for several minutes with transitory blocking) so why did it suddenly block everyone.

The answer here is in what is being blocked (the blockee) rather then the head blocker. Upon looking further into the details of the blocked SPIDs it turns out that they are all running the exact same code from the exact same machine, which is a BizTalk host instance in this environment. It turns out that this host instance has no control over error handling or throttling, and when one session gets blocked, it just opens another, and another and so on. Because it happened to clash with the trigger running under SPID 70 it continues to open more and more connections until eventually we reach complete resource consumption and there are no sessions left for any users whatsoever. Then we get out failover.

So the solution here was 2 fold, firstly we could attempt to optimize the work of the trigger which was running for a long time, to reduce the potential for concurrency issues, but secondly we could attempt to control the BizTalk host more effectively to stop it attempting to open so many connections concurrently. Without BizTalk being involved here, although we would have experienced some minor blocking, the main application OLTP throughput would still have functioned, but BizTalk stole all the server connections, and then we had our crash.

The moral of this story is that in complex production situations, you may need to make multiple data collections before you can come to the root of your problem (however deep your skills and knowledge are), and as in many situations it’s about knowing what diagnostics and tool to utilise, as opposed to knowing the answers off the top of your head.

Until the next time.

Graham

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.

SQL Server deadlocked schedulers leads to unexpected cluster failover–Part 1

Here at Basefarm we offer a number of different levels of SLA and support, which can be tailored to customers’ individual needs. One of the highest levels that we offer is called Basefarm Application Management. At this level we take full responsibility for the operations of the infrastructure but also the customer developed applications and code. The following post is an example from some troubleshooting I did for one such customer recently. It’s a typical SQL Server problem in high end production environments and one that you need to be able to handle if you want to be able to keep your SQL machines running 24×7. (it also has some good geek value as it allows us to employ several advanced diagnostic routines including a windows debugger) The analysis here is quite long so I’ve split it into 2 posts.

The symptoms were as follows: (they all appear within a couple of minutes of each other)

1. Monitoring starts to catch warnings of potential slow performance in front end web servers and middle tier applications

2. As we’re trying to log on to the machines in questions, customer reports problems with websites from user feedback

3. At the same time as 2 monitoring declares all web servers to be unavailable

4. We connect to the web servers and middle tier servers and they appear fine in themselves.

5. We try to connect to the SQL Server instance, but we can’t.

6. We try to connect to SQL using a DAC (dedicated administrator connection) but this fails as well as the windows cluster has just failed over.

7. All services come up without problem on the new node

All in all the performance problem lasts for a few very minutes (less than 10), and the actual complete web site outage is probably less than 2 minutes due to the resilience of the windows cluster and SQL Server, however if you’re aiming for a very high uptime, small outages like this will really hurt you. (and I take it as a personal affront if machines I’m responsible for run slowly and failover!)

My initial plan when dealing with this, was to log into the SQL machine (as I suspected this was where the problem was out of gut feeling) and run some diagnostics to prove or dis-prove this hypothesis. As it was the cluster proved this hypothesis for me by failing over before I could connect. So this is the evidence I was left to work with:

SQL memory dump in the LOG directory of the SQL instance

SQL error log

The error log showed this (certain identifying information removed for anonymity)

2011-02-20 14:41:26.75 spid1061    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.75 spid1061    The client was unable to reuse a session with SPID 1061, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.75 spid1056    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.75 spid1056    The client was unable to reuse a session with SPID 1056, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.76 spid1063    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.76 spid1063    The client was unable to reuse a session with SPID 1063, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.76 spid1066    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.76 spid1066    The client was unable to reuse a session with SPID 1066, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.89 Server      **Dump thread – spid = 0, EC = 0x0000000000000000
2011-02-20 14:41:26.89 Server      ***Stack Dump being sent to M:\MSSQL10.MSSQLSERVER\MSSQL\LOG\SQLDump0013.txt
2011-02-20 14:41:26.89 Server      * *******************************************************************************
2011-02-20 14:41:26.89 Server      *
2011-02-20 14:41:26.89 Server      * BEGIN STACK DUMP:
2011-02-20 14:41:26.89 Server      *   02/20/11 14:41:26 spid 6072
2011-02-20 14:41:26.89 Server      *
2011-02-20 14:41:26.89 Server      * Deadlocked Schedulers
2011-02-20 14:41:26.89 Server      *
2011-02-20 14:41:26.89 Server      * *******************************************************************************
2011-02-20 14:41:26.89 Server      * ——————————————————————————-
2011-02-20 14:41:26.89 Server      * Short Stack Dump
2011-02-20 14:41:26.94 Server      Stack Signature for the dump is 0x0000000000000363

I’ve highlighted 2 key points here.

The 18056 error is the first evidence we see in the error log of anything being wrong in SQL. I don’t find this error particularly helpful actually in terms of its description, but I do know that I recognise it as being something that often occurs as machines start to lock up in the SQL engine and / or begin to shutdown / failover. In this case it’s more of what we in England would call “A red herring” as there actually aren’t any issues here with connection pooling.

The we see the key information in that there are deadlocked schedulers and we’ve produced a memory dump. This is the real root problem here. It means that every SQL scheduler is unable to process any work, which means that the server is effectively dead at that point in time.

At this point the windows cluster will detect that the SQL Server is unavailable (as it can’t log in to run it’s health routines) and it therefore fails the service over to another node (as defined in our cluster resource failover settings).

So this is good as we know somewhat which type of problem we are encountering and where it lies (within the SQL engine) but we have no idea why. I did actually check in the SQL default trace file to ensure that no nasty DBCC events or file growths has occurred here. They hadn’t which meant that I could form a hypothesis that it was probably a standard user SQL batch event causing the issue.

At this point it’s time to open the memory dump files and see if we can find anything of use. As usual with SQL, we have 3 files, the dump itself, a textual representation of the dump and an extract of the error log. The latter 2 here only really give us what’s shown above, so we need to open the dump itself to get anything good. At this stage if you don’t like debugging you might like to skip to part 2 of this post (which is collecting more information and diagnosing the root problem, but the dump is interesting in (to me at least).

There’s no time here to explain how to configure a windows debugger, but it’s well documented on MSDN. The setup I used is this:

A default installation of debugging tools for windows downloaded from here:

http://www.microsoft.com/whdc/devtools/debugging/installx86.Mspx

A symbol path set exactly as per the example “getting started” page here:

http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

i.e. the symbol path used is:

SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

I open the dump and it takes me to straight to thread 4 which has this stack indicating that the this is just the deadlock checking thread.

0:004> kL
Child-SP          RetAddr           Call Site
00000000`0795d598 00000000`779ec0b0 ntdll!ZwWaitForSingleObject+0xa
00000000`0795d5a0 00000000`01c21879 kernel32!WaitForSingleObjectEx+0x9c
00000000`0795d660 00000000`01c2123b sqlservr!CDmpDump::DumpInternal+0x4d9
00000000`0795d760 00000000`025fb0f0 sqlservr!CDmpDump::Dump+0x3b
00000000`0795d7b0 00000000`026b45be sqlservr!SQLDumperLibraryInvoke+0x1a0
00000000`0795d7e0 00000000`028020f0 sqlservr!CImageHelper::DoMiniDump+0x3ce
00000000`0795d9c0 00000000`0280c0eb sqlservr!stackTrace+0x710
00000000`0795eef0 00000000`025d3867 sqlservr!SQL_SOSDeadlockSchedulersCallback+0xab
00000000`0795ef30 00000000`01f2c2df sqlservr!SOS_OS::ExecuteDeadlockSchedulerCallbacks+0xb7
00000000`0795f1f0 00000000`00ec18ae sqlservr!_chkstk+0x3f17c2
00000000`0795f380 00000000`03585579 sqlservr!SchedulerMonitor::Run+0xfe
00000000`0795f4a0 00000000`00eb163b sqlservr!SchedulerMonitor::EntryPoint+0x9
00000000`0795f4d0 00000000`00eb12fa sqlservr!SOS_Task::Param::Execute+0x11b
00000000`0795f5f0 00000000`00eb0e35 sqlservr!SOS_Scheduler::RunTask+0xca
00000000`0795f680 00000000`01430e50 sqlservr!SOS_Scheduler::ProcessTasks+0x95
00000000`0795f6f0 00000000`014309a0 sqlservr!SchedulerManager::WorkerEntryPoint+0x110
00000000`0795f7b0 00000000`01449ab0 sqlservr!SystemThread::RunWorker+0x60
00000000`0795f7e0 00000000`0143013f sqlservr!SystemThreadDispatcher::ProcessWorker+0x12c
00000000`0795f870 00000000`75a337d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12f
00000000`0795f900 000007ff`fffad488 msvcr80!endthreadex+0x47
00000000`0795f908 00000000`00ddc320 0x7ff`fffad488
00000000`0795f910 00000000`00000000 0xddc320

It’s at this point when debugging can become a little frustrating. As an ex-Microsoft engineer I’m used to having source access and private symbols, but now that I’m out in the “real” world, I only have public symbols like everyone else (not for example that I have no symbol for the bottom 2 rows in the stack – these are just the windows thread start functions so not a problem here, but for some reason I can’t get this customers’ windows symbols from Microsoft). Now this may well be fine for CLR debugging but our options are very limited for SQL. That said we can simply run ~* kL to check the stack of all of the process threads. Being as SQL source is written in a beautifully structured way (most of the time!) the method and function names are usually very indicative of what is happening and with a little common sense you may be able to get a clue to what’s going on.

The output of every thread is obviously too much to show here, but here’s a simple summary.

There are a total of 674 threads in this dump of which the first 70 or so are what I would classify as system threads. This isn’t an official piece of terminology, it’s just that whenever you look at a SQL dump the first threads in numerical order are always common functions within all SQL processes such as deadlock checker (thread 4 in our case) , resource monitor, service brokers, the list goes on, but they can basically be ignored in this case. What’s interesting is that after thread 75 there are basically 600 threads with an identical stack which looks like this

76  Id: 1664.2210 Suspend: 0 Teb: 000007ff`ffbcc000 Unfrozen
Child-SP          RetAddr           Call Site
00000009`a480b678 00000000`77a342b7 ntdll!ZwSignalAndWaitForSingleObject+0xa
00000009`a480b680 00000000`00eb3d65 kernel32!SignalObjectAndWait+0x117
00000009`a480b740 00000000`00ea3edc sqlservr!SOS_Scheduler::SwitchContext+0x84e
00000009`a480b9e0 00000000`00ea3a58 sqlservr!SOS_Scheduler::SuspendNonPreemptive+0xbc
00000009`a480ba20 00000000`013a667d sqlservr!EventInternal<Spinlock<149,1,0> >::Wait+0x428
00000009`a480c010 00000000`013a6557 sqlservr!LockOwner::Sleep+0xbd
00000009`a480c080 00000000`00ecb584 sqlservr!lck_lockInternal+0xe59
00000009`a480ce00 00000000`00f47e53 sqlservr!RowsetLockCache::GetTableLock+0x194
00000009`a480cf20 00000000`00f6dc81 sqlservr!MDL::FLockObjectId+0xb3
00000009`a480cfb0 00000000`00f6dbf6 sqlservr!SMD::LockObjectAndCheckOnlineVersion+0x71
00000009`a480d030 00000000`00f6d420 sqlservr!CXStmtQuery::XretSchemaChanged+0x860
00000009`a480d160 00000000`00f6d2fa sqlservr!CXStmtDML::XretDMLExecute+0x100
00000009`a480d260 00000000`00f19616 sqlservr!CXStmtDML::XretExecute+0x4a
00000009`a480d2a0 00000000`00f1abbb sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x55a
00000009`a480d550 00000000`00f1a4d9 sqlservr!CMsqlExecContext::FExecute+0x58b
00000009`a480d6d0 00000000`03021534 sqlservr!CSQLSource::Execute+0x319
00000009`a480d800 00000000`03021c4a sqlservr!CStmtExecProc::XretLocalExec+0x1b4
00000009`a480d870 00000000`03021e39 sqlservr!CStmtExecProc::XretExecExecute+0x3da
00000009`a480df20 00000000`00f19616 sqlservr!CXStmtExecProc::XretExecute+0x19
00000009`a480df60 00000000`00f1abbb sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x55a
00000009`a480e210 00000000`00f1a4d9 sqlservr!CMsqlExecContext::FExecute+0x58b
00000009`a480e390 00000000`02aae08d sqlservr!CSQLSource::Execute+0x319
00000009`a480e4c0 00000000`02aba374 sqlservr!ExecuteSql+0x72d
00000009`a480ef40 00000000`03496361 sqlservr!CSpecProc::ExecuteSpecial+0x234
00000009`a480f060 00000000`00f1d10b sqlservr!CSpecProc::Execute+0x1f1
00000009`a480f1d0 00000000`00f17b4a sqlservr!process_request+0x370
00000009`a480f490 00000000`00eb163b sqlservr!process_commands+0x1ba
00000009`a480f690 00000000`00eb12fa sqlservr!SOS_Task::Param::Execute+0x11b
00000009`a480f7b0 00000000`00eb0e35 sqlservr!SOS_Scheduler::RunTask+0xca
00000009`a480f840 00000000`01430e50 sqlservr!SOS_Scheduler::ProcessTasks+0x95
00000009`a480f8b0 00000000`014309a0 sqlservr!SchedulerManager::WorkerEntryPoint+0x110
00000009`a480f970 00000000`01449ab0 sqlservr!SystemThread::RunWorker+0x60
00000009`a480f9a0 00000000`0143013f sqlservr!SystemThreadDispatcher::ProcessWorker+0x12c
00000009`a480fa30 00000000`75a337d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12f
00000009`a480fac0 000007ff`ffbcd488 msvcr80!endthreadex+0x47
00000009`a480fac8 00000000`2bb43140 0x7ff`ffbcd488
00000009`a480fad0 00000000`00000000 0x2bb43140

This is a rare scenario, as if you took a dump of a working SQL Server (you can do this with SQLDumper and it’s a great technique employed in the famous Ken Henderson book) you’d see all the different threads running different classes and methods as each one is processing it’s own workload.

The highlighted functions near the top of the stack should be self explanatory. This thread is processing a worker which is trying to acquire a lock. The lock can’t be acquired at this time, so it goes into sleep mode. This is a very common example of how standard lock blocking looks in memory dumps. Since nearly the entire dumps’ threads are in this function, what we appear to have here is a huge lock blocking chain, which eventually leads to the actual failover event, which is deadlocked schedulers, i.e. no schedulers at all are available to process work, as they are all stuck waiting for this block to clear. The other function I’ve highlighted is one to note whenever you are debugging SQL Server (well versions 2000, 2005, 2008 and 2008 R2, as this could change in future), and it’s the [sqlservr!process_request] which is the function that all standard TSQL batch executions go through. In simple terms, if you see this on the thread stack then your thread is processing TSQL statements.

So in summary at this point we now know that out server failed over because a huge blocking chain ground the server to a complete halt. Eventually the blocking chain was so long that the cluster resource monitor could not log in. At this point it deemed the server was down (it was actually up but only in a theoretical sense as it couldn’t process new work) and the cluster failed it over. This also probably explains why it came back up without a problem, as the head blocker would have been killed by the failover and then other work resumed normally.

The question at this point obviously, is what was causing the blocking. Now if we had private symbols, we could simply explore through the locals on that stack and dig out the memory sections which describe the actual TSQL being executed. Since only Microsoft engineers themselves can do this, we have 2 options, either send the dump to MS and pay for them to look at it, or prepare advanced diagnostics to attempt to the capture the problem again using public tools, the next time it happens. Since I’d prefer not to pay Microsoft just to sort out blocking problems, we’ll do the latter, and part 2 of this post describes what we did.

Until the next time.

Graham

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.

SQL Server 2008 crash: ex_terminator – Last chance exception handling

Here at Basefarm we run a large number of 24×7 business critical systems for our customers. So when one of our windows customers recently encountered an unexplained SQL Server crash on one such system, it was something that got our highest attention.

In my long experience with SQL Server (I’ve been working with it since 1997) it’s fairly rare to get a real crash, especially in the more recent versions. People often refer to crashes or incidents or outages, but it’s very rare that the actual windows process terminates unexpectedly. A classic example of this confusion is an unexpected cluster failover. Such an event may manifest itself as an outage, and obviously the process itself has stopped on one node and started on another, but in the majority of cases this is because something has happened within the database engine which makes the cluster resource monitor implement a failover. (more of that another time maybe) But in this case, this was not the scenario, the sqlservr.exe process just died suddenly with no warning.

In this case the only evidence and symptoms we had were 3 files in the log directory of the SQL instance. (Our monitoring detected no performance problems leading up to the crash for example.) If you get unexplained behavior in SQL Server it’s always a good idea to check out the log directory for these files. Here are the one we had:

      SQLDump0001.txt

 

      SQLDump0001.mdmp

 

    SQLDump0001.log

These are the standard output from SQL Server in the later versions where critical errors are encountered. The files represent a memory dump (.mdmp), a textual representation of the memory dump (.txt), and a small extract from the SQL Server log file at the time of the crash (.log).

The number suffix is just a a unique naming convention in case SQL needs to create multiple dumps. In this case we had the pleasant view that this was the first dump file this SQL instance had ever created (0001) which in itself is a good sign of reliability. Over the years I’ve seen machines where they ran out of disk space because thousands of these files were created, and no-one ever noticed or bothered to look at them, so if you’re not checking your SQL machines for such evidence now, you should be!

The best and quickest thing to do here is to open the 2 text files and see what you can see, for example are there any interesting errors in the error log leading up to the crash. In this case though we had nothing, the error log was completely clean, only successful backup messages were being logged, so no clues there.

Then we open the text representation of the dump and we see this

=====================================================================
BugCheck Dump
=====================================================================

This file is generated by Microsoft SQL Server
version 10.0.2531.0
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.

Computer type is Intel(R) Xeon(R) CPU E5430 @ 2.66GHz.
Bios Version is HP - 2
DL 380 G5 v1.4 6-fan Woodcrest
8 X64 level 8664, 14 Mhz processor (s).
Windows NT 6.0 Build 6002 CSD Service Pack 2.

 

 

Memory
MemoryLoad = 89%
Total Physical = 49149 MB
Available Physical = 5092 MB
Total Page File = 98509 MB
Available Page File = 54094 MB
Total Virtual = 8388607 MB
Available Virtual = 8338574 MB
**Dump thread - spid = 0, EC = 0x0000000000000000
***Stack Dump being sent to XXXXXXXXXXXXXXXXXXXX\MSSQL\LOG\SQLDump0001.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 03/21/11 12:55:22 spid 2308
*
* ex_terminator - Last chance exception handling
*
*

I’ve highlighted a couple of the key points here. The first is the exact build number you’re running, the second is that a memory usage summary is printed out. This isn’t going to allow you to do any detailed analysis but at least proves that there is physical and virtual memory available on the box.

The third is the key point which is why this dump was generated. In this case it’s

“ex_terminator – Last chance exception handling“

This as I said before is a very uncommon error and pretty much unhelpful to us, as all it tells us is that SQL has written this dump out as part of its last chance exception handling when terminating. Often when you read these text representations of memory dumps, you actually see a useful message here, and sometimes even the SQL code which was running which caused the exception. An example of the better type of error you can get get is here on my old MSDN blog:

http://blogs.msdn.com/b/grahamk/archive/2008/09/11/non-yielding-iocp-listener-stack-dump-analysis.aspx

In this case we’re out of luck though, so our only option is to send the dump to Microsoft for analysis, as it says in the top of text file.

The bottom line is that there is a fix for this particular problem in CU3 for SQL 2008 SP2. (http://support.microsoft.com/kb/2498535) Applying this patch ought to stop the problem occurring in future. If you have a Microsoft support contract and are experiencing this problem though, I would strongly recommend that you get them to analyse the dump file for you to make sure it’s the exact same problem though, as it could theoretically be something else.

As you might have gauged from the above MSDN blog, you can often do rudimentary reviews of SQL Server dump yourself with a debugger if you want to know what’s in them. There’s a limit to how far you can take this due to the fact that SQL Server only releases public symbols on the Microsoft symbol server. Normally in scenarios such as this I would decode the dump myself and get out a thread stack meaning that people reading this post could check that they have the exact same problem. However unluckily in this case the correct public symbols haven’t been released onto the Microsoft symbol server so I was unable to get the stack out.

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.