SQL Server deadlocked schedulers leads to unexpected cluster failover–Part 1

Here at Basefarm we offer a number of different levels of SLA and support, which can be tailored to customers’ individual needs. One of the highest levels that we offer is called Basefarm Application Management. At this level we take full responsibility for the operations of the infrastructure but also the customer developed applications and code. The following post is an example from some troubleshooting I did for one such customer recently. It’s a typical SQL Server problem in high end production environments and one that you need to be able to handle if you want to be able to keep your SQL machines running 24×7. (it also has some good geek value as it allows us to employ several advanced diagnostic routines including a windows debugger) The analysis here is quite long so I’ve split it into 2 posts.

The symptoms were as follows: (they all appear within a couple of minutes of each other)

1. Monitoring starts to catch warnings of potential slow performance in front end web servers and middle tier applications

2. As we’re trying to log on to the machines in questions, customer reports problems with websites from user feedback

3. At the same time as 2 monitoring declares all web servers to be unavailable

4. We connect to the web servers and middle tier servers and they appear fine in themselves.

5. We try to connect to the SQL Server instance, but we can’t.

6. We try to connect to SQL using a DAC (dedicated administrator connection) but this fails as well as the windows cluster has just failed over.

7. All services come up without problem on the new node

All in all the performance problem lasts for a few very minutes (less than 10), and the actual complete web site outage is probably less than 2 minutes due to the resilience of the windows cluster and SQL Server, however if you’re aiming for a very high uptime, small outages like this will really hurt you. (and I take it as a personal affront if machines I’m responsible for run slowly and failover!)

My initial plan when dealing with this, was to log into the SQL machine (as I suspected this was where the problem was out of gut feeling) and run some diagnostics to prove or dis-prove this hypothesis. As it was the cluster proved this hypothesis for me by failing over before I could connect. So this is the evidence I was left to work with:

SQL memory dump in the LOG directory of the SQL instance

SQL error log

The error log showed this (certain identifying information removed for anonymity)

2011-02-20 14:41:26.75 spid1061    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.75 spid1061    The client was unable to reuse a session with SPID 1061, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.75 spid1056    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.75 spid1056    The client was unable to reuse a session with SPID 1056, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.76 spid1063    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.76 spid1063    The client was unable to reuse a session with SPID 1063, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.76 spid1066    Error: 18056, Severity: 20, State: 29.
2011-02-20 14:41:26.76 spid1066    The client was unable to reuse a session with SPID 1066, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
2011-02-20 14:41:26.89 Server      **Dump thread – spid = 0, EC = 0x0000000000000000
2011-02-20 14:41:26.89 Server      ***Stack Dump being sent to M:\MSSQL10.MSSQLSERVER\MSSQL\LOG\SQLDump0013.txt
2011-02-20 14:41:26.89 Server      * *******************************************************************************
2011-02-20 14:41:26.89 Server      *
2011-02-20 14:41:26.89 Server      * BEGIN STACK DUMP:
2011-02-20 14:41:26.89 Server      *   02/20/11 14:41:26 spid 6072
2011-02-20 14:41:26.89 Server      *
2011-02-20 14:41:26.89 Server      * Deadlocked Schedulers
2011-02-20 14:41:26.89 Server      *
2011-02-20 14:41:26.89 Server      * *******************************************************************************
2011-02-20 14:41:26.89 Server      * ——————————————————————————-
2011-02-20 14:41:26.89 Server      * Short Stack Dump
2011-02-20 14:41:26.94 Server      Stack Signature for the dump is 0x0000000000000363

I’ve highlighted 2 key points here.

The 18056 error is the first evidence we see in the error log of anything being wrong in SQL. I don’t find this error particularly helpful actually in terms of its description, but I do know that I recognise it as being something that often occurs as machines start to lock up in the SQL engine and / or begin to shutdown / failover. In this case it’s more of what we in England would call “A red herring” as there actually aren’t any issues here with connection pooling.

The we see the key information in that there are deadlocked schedulers and we’ve produced a memory dump. This is the real root problem here. It means that every SQL scheduler is unable to process any work, which means that the server is effectively dead at that point in time.

At this point the windows cluster will detect that the SQL Server is unavailable (as it can’t log in to run it’s health routines) and it therefore fails the service over to another node (as defined in our cluster resource failover settings).

So this is good as we know somewhat which type of problem we are encountering and where it lies (within the SQL engine) but we have no idea why. I did actually check in the SQL default trace file to ensure that no nasty DBCC events or file growths has occurred here. They hadn’t which meant that I could form a hypothesis that it was probably a standard user SQL batch event causing the issue.

At this point it’s time to open the memory dump files and see if we can find anything of use. As usual with SQL, we have 3 files, the dump itself, a textual representation of the dump and an extract of the error log. The latter 2 here only really give us what’s shown above, so we need to open the dump itself to get anything good. At this stage if you don’t like debugging you might like to skip to part 2 of this post (which is collecting more information and diagnosing the root problem, but the dump is interesting in (to me at least).

There’s no time here to explain how to configure a windows debugger, but it’s well documented on MSDN. The setup I used is this:

A default installation of debugging tools for windows downloaded from here:

http://www.microsoft.com/whdc/devtools/debugging/installx86.Mspx

A symbol path set exactly as per the example “getting started” page here:

http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

i.e. the symbol path used is:

SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

I open the dump and it takes me to straight to thread 4 which has this stack indicating that the this is just the deadlock checking thread.

0:004> kL
Child-SP          RetAddr           Call Site
00000000`0795d598 00000000`779ec0b0 ntdll!ZwWaitForSingleObject+0xa
00000000`0795d5a0 00000000`01c21879 kernel32!WaitForSingleObjectEx+0x9c
00000000`0795d660 00000000`01c2123b sqlservr!CDmpDump::DumpInternal+0x4d9
00000000`0795d760 00000000`025fb0f0 sqlservr!CDmpDump::Dump+0x3b
00000000`0795d7b0 00000000`026b45be sqlservr!SQLDumperLibraryInvoke+0x1a0
00000000`0795d7e0 00000000`028020f0 sqlservr!CImageHelper::DoMiniDump+0x3ce
00000000`0795d9c0 00000000`0280c0eb sqlservr!stackTrace+0x710
00000000`0795eef0 00000000`025d3867 sqlservr!SQL_SOSDeadlockSchedulersCallback+0xab
00000000`0795ef30 00000000`01f2c2df sqlservr!SOS_OS::ExecuteDeadlockSchedulerCallbacks+0xb7
00000000`0795f1f0 00000000`00ec18ae sqlservr!_chkstk+0x3f17c2
00000000`0795f380 00000000`03585579 sqlservr!SchedulerMonitor::Run+0xfe
00000000`0795f4a0 00000000`00eb163b sqlservr!SchedulerMonitor::EntryPoint+0x9
00000000`0795f4d0 00000000`00eb12fa sqlservr!SOS_Task::Param::Execute+0x11b
00000000`0795f5f0 00000000`00eb0e35 sqlservr!SOS_Scheduler::RunTask+0xca
00000000`0795f680 00000000`01430e50 sqlservr!SOS_Scheduler::ProcessTasks+0x95
00000000`0795f6f0 00000000`014309a0 sqlservr!SchedulerManager::WorkerEntryPoint+0x110
00000000`0795f7b0 00000000`01449ab0 sqlservr!SystemThread::RunWorker+0x60
00000000`0795f7e0 00000000`0143013f sqlservr!SystemThreadDispatcher::ProcessWorker+0x12c
00000000`0795f870 00000000`75a337d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12f
00000000`0795f900 000007ff`fffad488 msvcr80!endthreadex+0x47
00000000`0795f908 00000000`00ddc320 0x7ff`fffad488
00000000`0795f910 00000000`00000000 0xddc320

It’s at this point when debugging can become a little frustrating. As an ex-Microsoft engineer I’m used to having source access and private symbols, but now that I’m out in the “real” world, I only have public symbols like everyone else (not for example that I have no symbol for the bottom 2 rows in the stack – these are just the windows thread start functions so not a problem here, but for some reason I can’t get this customers’ windows symbols from Microsoft). Now this may well be fine for CLR debugging but our options are very limited for SQL. That said we can simply run ~* kL to check the stack of all of the process threads. Being as SQL source is written in a beautifully structured way (most of the time!) the method and function names are usually very indicative of what is happening and with a little common sense you may be able to get a clue to what’s going on.

The output of every thread is obviously too much to show here, but here’s a simple summary.

There are a total of 674 threads in this dump of which the first 70 or so are what I would classify as system threads. This isn’t an official piece of terminology, it’s just that whenever you look at a SQL dump the first threads in numerical order are always common functions within all SQL processes such as deadlock checker (thread 4 in our case) , resource monitor, service brokers, the list goes on, but they can basically be ignored in this case. What’s interesting is that after thread 75 there are basically 600 threads with an identical stack which looks like this

76  Id: 1664.2210 Suspend: 0 Teb: 000007ff`ffbcc000 Unfrozen
Child-SP          RetAddr           Call Site
00000009`a480b678 00000000`77a342b7 ntdll!ZwSignalAndWaitForSingleObject+0xa
00000009`a480b680 00000000`00eb3d65 kernel32!SignalObjectAndWait+0x117
00000009`a480b740 00000000`00ea3edc sqlservr!SOS_Scheduler::SwitchContext+0x84e
00000009`a480b9e0 00000000`00ea3a58 sqlservr!SOS_Scheduler::SuspendNonPreemptive+0xbc
00000009`a480ba20 00000000`013a667d sqlservr!EventInternal<Spinlock<149,1,0> >::Wait+0x428
00000009`a480c010 00000000`013a6557 sqlservr!LockOwner::Sleep+0xbd
00000009`a480c080 00000000`00ecb584 sqlservr!lck_lockInternal+0xe59
00000009`a480ce00 00000000`00f47e53 sqlservr!RowsetLockCache::GetTableLock+0x194
00000009`a480cf20 00000000`00f6dc81 sqlservr!MDL::FLockObjectId+0xb3
00000009`a480cfb0 00000000`00f6dbf6 sqlservr!SMD::LockObjectAndCheckOnlineVersion+0x71
00000009`a480d030 00000000`00f6d420 sqlservr!CXStmtQuery::XretSchemaChanged+0x860
00000009`a480d160 00000000`00f6d2fa sqlservr!CXStmtDML::XretDMLExecute+0x100
00000009`a480d260 00000000`00f19616 sqlservr!CXStmtDML::XretExecute+0x4a
00000009`a480d2a0 00000000`00f1abbb sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x55a
00000009`a480d550 00000000`00f1a4d9 sqlservr!CMsqlExecContext::FExecute+0x58b
00000009`a480d6d0 00000000`03021534 sqlservr!CSQLSource::Execute+0x319
00000009`a480d800 00000000`03021c4a sqlservr!CStmtExecProc::XretLocalExec+0x1b4
00000009`a480d870 00000000`03021e39 sqlservr!CStmtExecProc::XretExecExecute+0x3da
00000009`a480df20 00000000`00f19616 sqlservr!CXStmtExecProc::XretExecute+0x19
00000009`a480df60 00000000`00f1abbb sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x55a
00000009`a480e210 00000000`00f1a4d9 sqlservr!CMsqlExecContext::FExecute+0x58b
00000009`a480e390 00000000`02aae08d sqlservr!CSQLSource::Execute+0x319
00000009`a480e4c0 00000000`02aba374 sqlservr!ExecuteSql+0x72d
00000009`a480ef40 00000000`03496361 sqlservr!CSpecProc::ExecuteSpecial+0x234
00000009`a480f060 00000000`00f1d10b sqlservr!CSpecProc::Execute+0x1f1
00000009`a480f1d0 00000000`00f17b4a sqlservr!process_request+0x370
00000009`a480f490 00000000`00eb163b sqlservr!process_commands+0x1ba
00000009`a480f690 00000000`00eb12fa sqlservr!SOS_Task::Param::Execute+0x11b
00000009`a480f7b0 00000000`00eb0e35 sqlservr!SOS_Scheduler::RunTask+0xca
00000009`a480f840 00000000`01430e50 sqlservr!SOS_Scheduler::ProcessTasks+0x95
00000009`a480f8b0 00000000`014309a0 sqlservr!SchedulerManager::WorkerEntryPoint+0x110
00000009`a480f970 00000000`01449ab0 sqlservr!SystemThread::RunWorker+0x60
00000009`a480f9a0 00000000`0143013f sqlservr!SystemThreadDispatcher::ProcessWorker+0x12c
00000009`a480fa30 00000000`75a337d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12f
00000009`a480fac0 000007ff`ffbcd488 msvcr80!endthreadex+0x47
00000009`a480fac8 00000000`2bb43140 0x7ff`ffbcd488
00000009`a480fad0 00000000`00000000 0x2bb43140

This is a rare scenario, as if you took a dump of a working SQL Server (you can do this with SQLDumper and it’s a great technique employed in the famous Ken Henderson book) you’d see all the different threads running different classes and methods as each one is processing it’s own workload.

The highlighted functions near the top of the stack should be self explanatory. This thread is processing a worker which is trying to acquire a lock. The lock can’t be acquired at this time, so it goes into sleep mode. This is a very common example of how standard lock blocking looks in memory dumps. Since nearly the entire dumps’ threads are in this function, what we appear to have here is a huge lock blocking chain, which eventually leads to the actual failover event, which is deadlocked schedulers, i.e. no schedulers at all are available to process work, as they are all stuck waiting for this block to clear. The other function I’ve highlighted is one to note whenever you are debugging SQL Server (well versions 2000, 2005, 2008 and 2008 R2, as this could change in future), and it’s the [sqlservr!process_request] which is the function that all standard TSQL batch executions go through. In simple terms, if you see this on the thread stack then your thread is processing TSQL statements.

So in summary at this point we now know that out server failed over because a huge blocking chain ground the server to a complete halt. Eventually the blocking chain was so long that the cluster resource monitor could not log in. At this point it deemed the server was down (it was actually up but only in a theoretical sense as it couldn’t process new work) and the cluster failed it over. This also probably explains why it came back up without a problem, as the head blocker would have been killed by the failover and then other work resumed normally.

The question at this point obviously, is what was causing the blocking. Now if we had private symbols, we could simply explore through the locals on that stack and dig out the memory sections which describe the actual TSQL being executed. Since only Microsoft engineers themselves can do this, we have 2 options, either send the dump to MS and pay for them to look at it, or prepare advanced diagnostics to attempt to the capture the problem again using public tools, the next time it happens. Since I’d prefer not to pay Microsoft just to sort out blocking problems, we’ll do the latter, and part 2 of this post describes what we did.

Until the next time.

Graham

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.

2 replies
  1. Sreejith Gopalakrishnan
    Sreejith Gopalakrishnan says:

    Excellent thread, I had a similar issue yesterday in production. Still analyzing on the deadlock schedulers. But the hard part is we had mirroring and it didn’t fail-over! Do you know of any case of scenario during thread starvation why mirroring didn’t fail-over. I can see in your post you mentioned cluster should fail-over but we experienced no fail-over for mirroring in this situation.
    Please find the question I posted in MSDN forum with all details,

    https://social.msdn.microsoft.com/Forums/sqlserver/en-US/db84760d-e169-4486-a355-541727408145/thread-starvation-database-mirroring-failed-to-failover?forum=sqldatabasemirroring

    If you are aware of this is any known issue with Sql Server 2008 R2 Standard 10.50.4000, please let me know.

Trackbacks & Pingbacks

  1. […] Above the clouds Basefarm's technology blog Skip to content Home ← SQL Server deadlocked schedulers leads to unexpected cluster failover–Part 1 […]

Comments are closed.