Posts

SQL Server TSM Restore appears to hang or takes much longer than expected

I’ve written previously about the dangers of VLF fragmentation, but the problems I’ve actually experienced before were always related to log based operations, i.e. recovery phase after a crash or database mirroring, but last week I saw an alternate issue where doing a full restore from IBM Tivoli Storage Manager (TSM).

At the start I can say the same thing that I always say when writing about this subject

Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.

That said, here’s an example of what happens when you don’t, and more importantly how to fix it.

The Symptoms

You’re trying to do a full database restore via TSM (GUI or command line)

The restore takes much longer than you expected

The progress bar in TSM GUI says that the restore is 100% complete, and yet it is still running

If you run a script similar to this one, it also says 100% complete, but the restore still runs

SELECT r.session_id,r.command,CONVERT(NUMERIC(6,2),r.percent_complete)

AS [Percent Complete],CONVERT(VARCHAR(20),DATEADD(ms,r.estimated_completion_time,GetDate()),20) AS [ETA Completion Time],

CONVERT(NUMERIC(10,2),r.total_elapsed_time/1000.0/60.0) AS [Elapsed Min],

CONVERT(NUMERIC(10,2),r.estimated_completion_time/1000.0/60.0) AS [ETA Min],

CONVERT(NUMERIC(10,2),r.estimated_completion_time/1000.0/60.0/60.0) AS [ETA Hours],

CONVERT(VARCHAR(1000),(SELECT SUBSTRING(text,r.statement_start_offset/2,

CASE WHEN r.statement_end_offset = -1 THEN 1000 ELSE (r.statement_end_offset-r.statement_start_offset)/2 END)

FROM sys.dm_exec_sql_text(sql_handle)))

FROM sys.dm_exec_requests r WHERE command IN (‘RESTORE DATABASE’,’BACKUP DATABASE’)

It can be several hours (or even days) in this state.

In the error log all you see are rows indicating that the restore has started

Starting up database ‘xxx’.

The explanation

It’s most likely that your database log file has become logically fragmented into many virtual log files (VLF). Many means different things to different systems, but more than 1000 can definitely be a problem. In the problem I encountered last week it was 17000 which made a 25 minute restore take 3 hours longer than expected.

If you’re unfamiliar with the principals of VLF you should read the following:

Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.

If you want to check any of your critical database now to see whether you have this fragmentation, you can run the following:

DBCC LOGINFO (‘xxx’)

This is one of those commands that is officially undocumented, but that everyone actually uses! It’s been raised on the connect site to have it moved into a DMV in the future

https://connect.microsoft.com/SQLServer/feedback/details/322149/msft-mso-document-dbcc-loginfo-or-create-new-dmvs-to-view-vlf-info

I’ve run it innumerable times on production databases though (normally when fixing problems like this).

If the rows returned is greater than 1000 you might have a problem. To test whether you have a problem all you need to do is attempt to restore a backup of the database (you can do this in a test server) and see if you experience an unreasonable delay. If you do , then I would recommend you try to fix it.

The solution

You need to truncate and shrink the log back to a point where the fragmentation does not occur. The great people at SQL Skills have a very full article on this here:

http://www.sqlskills.com/blogs/kimberly/post/8-Steps-to-better-Transaction-Log-throughput.aspx

The (very easy) script to fix it is right at the end, but if you’re in this position and you’ve come this far, I’d recommend that you read the above article in full to understand why you got where you are.

Until the next time….

Graham

We’re hiring – https://www.basefarm.com/sv/jobb/Lediga-tjanster-Sverige/ Drop us a line if you’d like to come and work on interesting problems like this with us.