SQL Server deadlocked schedulers leads to unexpected cluster failover–Part 2

In Part 1, I looked at how one could identify that a large blocking chain was causing resource exhaustion in SQL Server leading to a cluster failover. Being as we were working retrospectively, we were lacking data to analyse what actual condition was causing the blocking chain to begin with. This post address that area of the troubleshooting.

The problem had occurred a number of times in this environment, and no-one had ever managed to log into the machine to analyse it before it failed over, so I wanted to put something in place proactively to ensure we caught the condition as it happened. For reference had I been able to log into the machine I would have simply run the SQL 2008 Perfstats scripts as a starting point, which I tend to keep handy on my critical production machines, just in case I ever need it in a hurry. I just run the script for a couple of minute thru SSMS and you often get a good look at the problems. You can download the scripts from codeplex, as they are held as part of the SQLNexus project.

However in this case I have the opportunity to capture exactly what I want as I’m planning to setup everything in advance. This means that it’s a perfect opportunity to use SQLDIAG to collect all the data I want in one simple place. If you don’t know what SQLDIAG is, then read these first:

Performance Problems in SQL Server – What data should I collect?

http://msdn.microsoft.com/en-us/library/ms162833.aspx

The good thing nowadays (since the above were written) is that there are a couple of automated ways to get things going, as opposed to manually editing XML files. Personally I do still sometimes do manual edits depending upon how many customizations of my own I want to put in, but we also have the options of using the BAT files supplied in the SQLNEXUS project or I’ve recently noticed that a SQLDIAG configuration tool has been released.

The bottom line here is that I configured a SQLDIAG to collect the following:

1. Perfstats scripts

2. Perfmon of common SQL and OS counters

3. DBCC MEMORYSTATUS (I always like to include this as it has such a low overhead and can be incredibly useful if you get unexpected internal problems like compilation bottlenecks)

4. General base information (defaults from SQLDIAG)

5. Profiler – restricted to batch level events, as this is a heavily used system.

***IMPORTANT NOTE***

You need to show extreme care with running profiler nowadays on some of the more high end SQL machines. In the past 2 year I’ve seen a couple of enterprise sharepoint systems crashed through the use of excessive SQL Profiler traces. If you’ve not run profiler before on your system and you have very high throughput, make sure you do a test run of just a server side profiler first at a very high level capturing very few events first, to check that your server and disks can handle the throughput.

***IMPORTANT NOTE***

Since I didn’t know when the problem would occur next, I then needed to implement a routine to manage the SQLDIAG start and stop times and file output. The BOL documentation above lists the switches and syntax you need to do this, with examples.

So I put all this in place and then waited for the problem to occur again, which is promptly did within a few days.

ANALYSIS:

Anyone who analyses performance problems within any frequency will probably tell you that there’s no right or wrong way to go about this, but in this case since I’m pretty certain I’m looking for a blocking problem, my first port of call is the perfstats scripts output. I’m suspecting that I’ll only need the other data to backup my findings, but I’ve got it there just in case.

Perfstats output is normally my first port of call in most performance cases anyway, as my first question is:

Is there a user batch running for a long time?

The answer to this question often removes whole areas of troubleshooting possibilities, so I like to answer it first. If the answer is yes, you can look at said batches and try to work out why, and if the answer is no, then you need to be looking at different areas such as internals. In this case, very clearly from perfstats, there was a long running batch. I’m limited to how much output I can show here for security reasons, but here’s some sample output from the [headblockersummary] section of perfstats.

(This wide output doesn’t display well on our fixed width blog, so if you want to read it, cut and paste it into a decent text editor like notepad ++ )

— headblockersummary —
runtime                        head_blocker_session_id blocked_task_count tot_wait_duration_ms blocking_resource_wait_type              avg_wait_duration_ms max_wait_duration_ms max_blocking_chain_depth
—————————— ———————– —————— ——————– —————————————- ——————– ——————– ————————
2011-02-20T14:36:25.090                            661                  1                84252 LOCK BLOCKING                                           84252                84252                        1
2011-02-20T14:36:25.090                             70                  1                 2899 LOCK BLOCKING                                            2899                 2899                        1

— headblockersummary —
runtime                        head_blocker_session_id blocked_task_count tot_wait_duration_ms blocking_resource_wait_type              avg_wait_duration_ms max_wait_duration_ms max_blocking_chain_depth
—————————— ———————– —————— ——————– —————————————- ——————– ——————– ————————
2011-02-20T14:36:45.507                            347                  3               105468 LOCK BLOCKING                                           35156               105059                        3
2011-02-20T14:36:45.507                             70                  2                36169 LOCK BLOCKING                                           18084                23706                        2
2011-02-20T14:36:45.507                            347                 52                 1630 PAGELATCH_* WAITS                                          31                   98                        3
2011-02-20T14:36:45.507                            391                 20                  318 PAGELATCH_* WAITS                                          15                   60                        1
2011-02-20T14:36:45.507                            393                 12                  129 PAGELATCH_* WAITS                                          10                   17                        1
2011-02-20T14:36:45.507                            377                  1                   37 PAGELATCH_* WAITS                                          37                   37                        1

— headblockersummary —
runtime                        head_blocker_session_id blocked_task_count tot_wait_duration_ms blocking_resource_wait_type              avg_wait_duration_ms max_wait_duration_ms max_blocking_chain_depth
—————————— ———————– —————— ——————– —————————————- ——————– ——————– ————————
2011-02-20T14:40:36.170                             70                449             52689330 LOCK BLOCKING                                          117348               228943                        2

 

If you note the highlighted sections above, we can see that SPID 70 starts causing some blocking at 14:36:25, but it’s only blocking 1 session and there is other blocking anyway, so it appears transitory. This code is a trigger firing and it needs to do a large update which is going to take a few minutes (I can’t show the code or why this is, that’s just the way this application works)

20 seconds later SPID 70 is still blocking, but only directly blocking 2 sessions. Now there are about 80 other SPIDs blocked, and then are effectively behind SPID 70 deeper down the blocking chain. However these then clear and so the blocking still seems transitory, if we look at the web server logs over this period, the web servers are still up and the majority of all OLTP throughput is being served within normal limits.

But then 4 minutes later we see the same SPID 70 but this time its directly blocking 449 other sessions, and it’s at this point that we start to see alerts from our monitoring of the application servers. Over the next minute this blocking chain grows larger and larger until we hit the deadlocked schedulers and then shortly after that the cluster fails over.

So at first glance, we have a simple problem here. SPID 70 seems to have a long running piece of code which blocks all users on the system. But upon further investigation this trigger fires regularly without blocking the entire system (we saw leading up to the failure that it was running for several minutes with transitory blocking) so why did it suddenly block everyone.

The answer here is in what is being blocked (the blockee) rather then the head blocker. Upon looking further into the details of the blocked SPIDs it turns out that they are all running the exact same code from the exact same machine, which is a BizTalk host instance in this environment. It turns out that this host instance has no control over error handling or throttling, and when one session gets blocked, it just opens another, and another and so on. Because it happened to clash with the trigger running under SPID 70 it continues to open more and more connections until eventually we reach complete resource consumption and there are no sessions left for any users whatsoever. Then we get out failover.

So the solution here was 2 fold, firstly we could attempt to optimize the work of the trigger which was running for a long time, to reduce the potential for concurrency issues, but secondly we could attempt to control the BizTalk host more effectively to stop it attempting to open so many connections concurrently. Without BizTalk being involved here, although we would have experienced some minor blocking, the main application OLTP throughput would still have functioned, but BizTalk stole all the server connections, and then we had our crash.

The moral of this story is that in complex production situations, you may need to make multiple data collections before you can come to the root of your problem (however deep your skills and knowledge are), and as in many situations it’s about knowing what diagnostics and tool to utilise, as opposed to knowing the answers off the top of your head.

Until the next time.

Graham

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.