Resizing mgmt_tablespace in Grid Control 12c and reclaiming space

We noticed the MGMT_TABLESPACE in our 12c Grid Control production database was very big. Almost 380gb of data was stored in this tablespace.

I did not think anything was wrong because of all the targets registered in this Grid Control instance and the frequent usage of this Grid Control.

Until someone looked into this and found out it was a bug in 12c…

Note 1502370.1 describes this bug and also the solution for this. I had to install a patch and truncate the em_job_metrics table. After applying the patch and truncating the table, there was only around 10gb of data left in the MGMT_TABLESPACE.

Because I already experienced something like this five years ago in a 10g OEM database , I knew that it is not easy to reclaim the space in this tablespace. And we had a lot to reclaim, only 10gb data in this tablespace of 380gb…

I decided to check my Oracle support, maybe someone else already had the same problem. I found bug 17461366, making it impossible to reorg the mgmt_tablespace because of aq objects. Same problem I faced five years ago. Because I really wanted to reclaim the free space (370gb), I decided to follow my own notes. Although these steps ware performed on a 10g OEM environment and now I was working with a 12c Grid Control environment.

Just like 5 years ago, I could think of two ways to reclaim the space:

1. export mgmt_tablespace, drop the tablespace and import it again

2. export sysman, drop repository and, run scripts from the oms_home and import sysman again

Export MGMT_TABLESPACE

Unfortunately, dropping the mgmt_tablespace was a mission impossible. The export succeeded without errors, but I was not able to drop the tablespace. After dropping some related objects that were showing in the errors I received, I decided this was not going to work. I restored the database and executed option 2.

Export Sysman user(s)

I decided to follow note 388090.1. This note describes a platform migration for a 10g Grid Control environment, but I could not find any document about a 12c Grid Control environment.

Just to be sure, I also exported (expdp) the other sysman users (sysman_apm, sysman_mds, sysman_opss and sysman). Also set job_queue_processes to zero

The next step was to drop the current repository using the repmanger utility ($OMS_HOME/sysman/admin/emdrep/bin)

The repmanager also did not drop the mgmt_tablespace, I had to drop the tablespace myself. But after deleting the repository I was able to drop the tablespace. The drop repository did drop all the sysman users and other tablespaces

According to the note, before the impdp, I had to run some scripts from the $OMS_HOME/sysman/admin/emdrep/sql/core/latest/admin directory.

These script should create the sysman users and tablespaces. The following sql’s were executed :

– admin_create_tablespaces.sql

– admin_create_repos_user.sql

– admin_pre_import.sql

– admin_sys_procs.sql

– admin_profiles.sql

– admin_grants_repos_user.sql

– admin_grants_view_user.sql

Now, you should be ready to inport the sysman scheme again (impdp). But the import showed a lot of errors. Mean reason, the sysman_ro user was not available and the mgmt_ad4j_ts tablespace was not created. Again I decided to drop the sysman user again and start over again. But this time, instead of running the scripts before the import I created a whole new repository using Repmanager.

The new repository did have the sysman_ro user and also the missing tablespace.

At this time I started the import again, using the ‘table_exist_action=replace’ option. The repository creation already created the sysman objects, so I wanted the impdp to replace the already created tables with the tables from the dump. I noticed one table creation error in the logfile, the em_job_type_creds_info table was not created by impdp. I had to create this one after the impdp.

Also the other sysman users were not created by the repmanager. I had to create and import these users (sysman_.mds, sysman_apm and sysman_opss).

After the import I returned to the note again, to check for post import steps.

Again in $OMS_HOME/sysman/admin/emdrep/sql/core/latest/admin

– admin_recompile_invalid.sql

– admin_create_synonyms.sql

– admin_post_import.sql

Reset the job_queue_processes to its original  value and submit the EM dbms jobs

– admin_submit_dbms_jobs.sql

After compiling, there were still problems with aq objects.

I dropped the aq objects (exec dbms_aqadm.drop_queue_table) and created (dbms_aqadm.create_queue_table) them again. This solved the problems and also cleared the invalid objects

At this point, there were no invalid objects anymore, so the oms could be started again to see what happens.

Oms started oke and I was able to login 12c Grid Control.

With 12c Grid Control running, I noticed the oms was bouncing every 12 minutes by itself.

I checked the emctl.msg file in /gc_inst/em/EMGC_OMS1/sysman/log for errors and found the following error:

HealthMonitor Nov 15, 2013 12:57:48 PM PbsAdminMsgListener error: PbsAdminMsgListener thread timed out.
Critical error err=3 detected in module PbsAdminMsgListener
OMS will be restarted. A full thread dump will be generated in the log file

It seemed the em_cntr_queue was missing? I checked and found out some other queues were also missing (not in the dump?) I recreated the missing queues (check the queues in an OTA Grid environment) and that solved the restart error. Now oms was not restarting by itself anymore.

Reclaimable free space

After these steps (took me two days…) I reclaimed 370gb. The mgmt_tablespace is now 10gb (instead of the 380gb before). The total size of the entire database was shrinked from more than 400gb to 22gb!!

OpenStack Folsom quick installation

We tend to use VMware for our virtual machines, but it’s always important to evaluate and try out new software which is why I decided to try out OpenStack (Folsom)! Three of the biggest advantages to OpenStack is, in my opinion, that it’s open source and free, plus backed by many large organizations such as Rackspace Cloud, NASA, AMD, Intel, Canonical, SUSE Linux, Red Hat, Cisco, Dell, HP, IBM and Yahoo! 🙂 You can find more information about OpenStack on the following site: http://www.openstack.org/

Installing OpenStack isn’t the easiest of things though, and doing a reinstallation of a system can be quite time consuming.
I’ve written up some scripts to simplify my installations of OpenStack, and decided to make a blog post about it as I’ve not been able to find any installation scripts for Folsom.

Quick Installation (under 5 minutes) scripts for OpenStack Folsom.
——————————-

A quick installation guide for OpenStack Folsom for a single or dual node installation.

Prerequisites:
– The base minimum is one Server with 2 Network cards (one public and one internal, or just one public ethernet card plus one virtual for internal use) running Ubuntu 12.04. Two servers are required if you want to run Compute (Nova) on a secondary server.
– Run the scripts as root.
– Have a minimum of two drives (which uses LVM)
– It’s recommended that you do this on a reinstalled test machine. It’s against all possible recommendations to run these scripts on a production environment as it could potentially break/delete everything.
– It’s recommended that you have a range of public/internal IPs so you can assign public IPs to your virtual machines.

Installation (as ROOT):
apt-get update
apt-get install git
git clone https://github.com/speedis/QuickInstallFolsom.git
cd openstackfolsom
chmod +x *.sh
./folsom01.sh
Follow the guide.

Uninstall:
To uninstall the whole installation (this will remove ALL trace of MySQL, Apache, RabbitMQ on the server etc so make sure you view and edit this file before running it!), run ./uninstall-folsom.sh

Mainly based on:
http://docs.openstack.org/trunk/openstack-compute/install/apt/content/
https://github.com/mseknibilel/OpenStack-Folsom-Install-guide/

Happy OpenStacking!

Debugging IIS crashes – default WER location

I was debugging some IIS crashes last week and thought I’d follow up with a few basics here as its a common enough problem. Another time I might write a series of posts on using the windows debuggers in detail and how one can go about this from scratch, but for the moment here’s a quick summary of some basic beginning points. I wrote some other more detailed examples of .NET debugging in the past on my MSDN blog, although these ones use slightly differrent CLR versions and extensions which have since been updated.

Firstly I walked into this situation blind as you often do in such matters. The developers of the application in question told me that they had been experiencing crashes across all their web servers since they last did a code deploy. (Insert questions and comments here about the testing regime which allows this to occur). The windows error logs showed the following in the application event log

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: MSVCR100_CLR0400.dll, version: 10.0.30319.1, time stamp: 0x4ba2211c
Exception code: 0xc00000fd
Fault offset: 0x0000000000057f91
Faulting process id: 0x11f0
Faulting application start time: 0x01cd29d083c0e51e
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Windows\system32\MSVCR100_CLR0400.dll
Report Id: fdd757b8-95ee-11e1-94a4-005056bc00a6

The key here is the Exception code: 0xc00000fd, which translates as stack overflow (never good!). I pulled the logs and agreed with them in their initial assessment, but they said that they couldn’t find any dumps that had been auto produced. As such I immediately attached debugdiag to one of the web servers to ensure that I could capture a full dump the next time it occurred. However once this was in place I went back through the logs and dug around the server in more detail to check out whether it was really the case that the server had not produced any dumps automatically. Sometimes in Windows 2008 and above WER logging is not particularly transparent in what its doing, so I checked manually. After a short while of searching for .dmp or .mdmp files I noted that the default WER location for these servers was

C:\ProgramData\Microsoft\Windows\WER\ReportQueue

Once I browsed to here I found a treasure trove of old dumps and error logs and all sorts of joy which helped me diagnose the issue. The WER had not written to the event logs that it was taking dumps and collecting information, but all the same I wasn’t surprised to see that it had been doing its stuff since there had been a lot of application crashes. It just goes to show that it’s always worth a look.

In this case the actual debug was fairly simple as a stack overflow crash is pretty simple to debug, it’s just a matter of these steps if you’re familair with windows debuggers:

1. Load the dump
2. set the symbols ensuring you have privates for the customer code
3. load your .net debugger extension (I used psscor4)
4. dump the stack of the thread with the stack overflow exception on
5. send the code to the developers and get them to fix it 🙂

Here’s hoping you don’t encounter any stack overflows yourselves!

SQL Server setup fails due to partitioned network warnings from cluster service

I was building a new SQL Server 2008 R2 failover cluster recently and encountered a problem that I hadn’t seen before (which is rare as I’ve seen A LOT of cluster setup problems in my time!). This time it was strange as it was an error before setup actually ran, it was when I was going through the dialogue boxes to configure setup.

The scenario was this:

1. Cluster was fully built and validated at a windows level, all resources were up and OK
2. I was about to run SQL Setup when I noticed the network binding order was wrong
3. I changed this and then decided to reboot both nodes as I always do this before a cluster setup
4. The nodes came back online OK and all resources came up as well
5. I ran setup but when I got to the cluster network configuration dialog box, there were no networks to select from, so you couldn’t go forward.

My first thought was that I must have done something dumb when changing the network binding order but checks on the network adapters showed that they were all up. I then went back through a few other things and noticed that the cause of the error was actually that the cluster service was having issues with connecting to one of the networks. There were 2 types of error / warning in the cluster logs and the system event logs:

Error

Cluster network ‘xxxxx’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Warning

Cluster network interface ‘xxxxx – xxxxx’ for cluster node ‘xxxxx’ on network ‘xxxxx’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

I had to engage the help of some network specialists as I couldn’t get to the bottom of this on my own. The networks actually appeared up and we could connect to them and use them independently outside of the cluster, but the cluster was convinced that they were partitioned. To cut a long story short, after checking many things we realised that the problem was down to the fact that one of the networks was actually a teamed network implemented using BASP virtual adapters, and this network team was not coming up fast enough after the node rebooted, before the cluster service tried to bind it in as a resource.

The fix was simple, in that we set the cluster service to delayed start and then everything was fine. We didn’t need to make any configuration changes beyond this. Once the cluster service was happy that the network was OK, SQL Server setup was able to continue just fine.

Good luck with your cluster builds!

Tracing select statements on specific objects in SQL Server without using Profiler

A developer of an application asked me an interesting question the other day. He wanted to know (for reasons not worth going into here) whether his application issued select statements against a specific table in his database. This database was in production and under heavy load, so although we could run a server side SQL Profiler here, and then read through the results, this could be a time consuming process which could generate an extremely large amount of data (and also a quite heavy load on the server). We also wanted to run this monitoring for a number of days if possible, so we needed something more lightweight if possible.

I thought about this for a while and realised that the best way to achieve this (assuming you are running SQL 2008 or later) would be through the new functionality SQL Audit. This uses the extended events framework as the basis for its tracing and therefore falls into the lightweight category.

Here’s an example of what I wrote, converted into simple test objects which you can try yourself. This example requires a table called dbo.audit_test to be present in a database named audit_db for you to test against.


USE master ;
GO
-- Create the server audit.
CREATE SERVER AUDIT test_server_audit
TO FILE ( FILEPATH =
'C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA' ) ;
GO

— Enable the server audit.
ALTER SERVER AUDIT test_server_audit
WITH (STATE = ON) ;
GO
— Move to the target database.
USE audit_db ;
GO

— Create the database audit specification.
CREATE DATABASE AUDIT SPECIFICATION audit_test_table
FOR SERVER AUDIT test_server_audit
ADD (SELECT , INSERT, UPDATE
ON dbo.audit_test BY dbo,guest, public, db_datareader)
WITH (STATE = ON) ;
GO

/*
do some work here
which will trigger the audit to record something
*/

/* uncomment these statements to turn off the audit at either DB or server level

–turn off the database audit
use audit_db
go
ALTER DATABASE AUDIT SPECIFICATION audit_test_table
WITH (STATE = OFF)

use master
go
–turn off the server audit
ALTER SERVER AUDIT test_server_audit
WITH (STATE = OFF) ;
GO

*/

Here’s the key things to note about the above example:

1. This one actually traces 3 type of table access SELECT, INSERT and UPDATE
2. It traces specific users and groups – you can change these as relevant to your example
3. It writes the output to the default DATA directory of a default 2008 R2 install – change as you see fit
4. You need to watch the potential file space this will take up as it can be very verbose in big systems
5. Watching the file space used in real time will not work, as it holds most of the data in memory and flushes when you stop the trace

Once you have the output you need (and you have turned off the audit – don’t forget!) you simply run something like this to view the data (you’ll need to locate the exact file name created each time you turn the audit on or off).


SELECT COUNT(*), statement
FROM sys.fn_get_audit_file ('C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\test_server_audit_7E707DDD-03F3-4FFA-B24B-BB0DDBF4D5F3_0_129714455341990000.sqlaudit',default,default)
group by statement
GO

As you can see the above does a simple count and aggregate of the results, but there are many columns in the output which you can write TSQL against (although since it’s backed by a file the access might be slow if you have large files or slow disks!).

I found this to be a most effective technique and it didn’t impact the performance of the server whatsoever.

You can read more about SQL Server Audit at MSDN.

pdfforge toolbar causes critical errors in Microsoft Office

Recently one of our customers (who has administrative access and is responsible for their own applications) chose to install the product pdfforge on their server. This server acts as a terminal server, providing Microsoft Office services for multiple users who access it remotely.

Everything had been working fine on this server when one Sunday morning it started throwing errors like this below and the whole of Microsoft Office stopped working.

Activation context generation failed for “C:\Program Files (x86)\ReadSoft\INVOICES\Bin\eiveri.exe.Manifest”.Error in manifest or policy file “” on line . A component version required by the application conflicts with another component version already active. Conflicting components are:. Component 1: C:\Windows\WinSxS\manifests\x86_Microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30304.0 _none_d9c474bda3593bfa.manifest. Component 2: C:\Windows\WinSxS\manifests\x86_microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30729. 4148_none_5090ab56bcba71c2.manifest.

We soon tracked down the start time of the office to be matching with the install of the pdfforge toolbar v4.7, so without much detailed troubleshooting we suspected this change to be the cause of the problem.

Unfortunately despite uninstalling the offending component and trying other actions like repairing MS Office, we were stuck with the same error.

Now it’s my personal opinion that troubleshooting office shared assemblies is about as much fun as a poke in the eye with a sharp stick, and moreover in my experience you can waste inordinate numbers of hours , if not days, with little to no success. The same often counts for troubleshooting corrupted SSMS installs that are having .NET assembly problems. I’m not saying you can’t do it, I have done it myself, I’d just like to be able to reclaim those days and have devoted them to something worthwhile!

So what I really wanted was a quicker solution, especially after I read numerous reports of this software being considered as spyware / malware or just generally invasive and horrible. I’m not making a judgment on any of these specific points, I’m just saying that if you Google about the toolbar you will find innumerable reports of people complaining that it trashed their systems. Your mileage may vary of course, but I didn’t like what I read at all.

It would appear that pdfforge itself is well respected but that the toolbar contains the more problematic issues. Looking back through our logs we could actually see that the toolbar v4.6 had run without issue for several months. Only the upgrade to v4.7 seemed to cause the issue. What was also worrying was that no-one authorised the v4.7 upgrade, it just installed itself in the background. You can read why in this excellent article:

http://jivebay.com/2010/06/23/pdfforges-pdfcreator-sneaks-in-pdfforge-toolbar/

Anyway, we took the pragmatic approach and ran a system state restore on the server in question to a time before the toolbar upgraded itself. We then advised the customer in question to consider whether they might want to remove this software completely, or at least ensure that it couldn’t download and install upgrades without user interaction.

SQL Server 2008 R2 setup fails due to invalid credentials

A colleague was trying to install a SQL Server 2008 R2 standalone instance today and he kept hitting a weird error when passing the credentials for the service accounts. He was running the setup as a local administrator on the server in question and he was trying to add a domain user as the service account.

The server in question was joined to the appropriate domain already and it had been checked that all the appropriate firewall ports were open. We knew that the user account he was using could query the domain structure as it was able to browse the domain to select the user account from the setup screen.

The errors we got for both accounts were (the text is truncated on the form)

The credentials you provided for the SQL Server Agent service are invalid. To continue…

The specified credentials for the SQL Server service are not valid. To continue…

We took a methodical troubleshooting approach and tested some other domain accounts which we knew were valid and were running other SQL instances elsewhere. These failed as well, meaning that we must have been encountering either some unexpected behaviour within this setup session, or we were being blocked from talking to the domain controllers in some fashion. We again checked the firewalls and they were confirmed as OK.

Then we went into the set logs. The summary log just showed that we had cancelled the installation. For example you see stacks like this which tells you nothing really

2011-09-12 15:59:21 Slp: Exception type: Microsoft.SqlServer.Chainer.Infrastructure.CancelException
2011-09-12 15:59:21 Slp:     Message:
2011-09-12 15:59:21 Slp:         User has cancelled.

However if you look in the detail.txt log file, contained one directory lower, you can scroll from the bottom up and find the actual cause of the problem in the page validation routines. It looks like this (I’ve removed the timestamps for better readability and also blanked all the identifying information obviously)

SQLEngine: –InputValidator: Engine : Attempting to get account sid for account DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account from sid S-1-5-21-999999999-999999999-999999999-9999
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN
Slp: Sco: Attempting to get sid for user account DOMAIN
Slp: Sco: GetSidForAccount normalized accountName DOMAIN parameter to DOMAIN
SQLEngine: –InputValidator: Engine : Service Acccount Specified, Validating Password
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to validate credentials for user account DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN
Slp: Sco: Attempting to get sid for user account DOMAIN
Slp: Sco: GetSidForAccount normalized accountName DOMAIN parameter to DOMAIN
Slp: Sco: Attempting to see if user DOMAIN\account exists
Slp: Sco.User.OpenRoot – Attempting to get root DirectoryEntry for domain/computer ‘DOMAIN’
Slp: Sco: Attempting to check if user account DOMAIN\account exists
Slp: Sco: Attempting to look up AD entry for user DOMAIN\account
Slp: Sco.User.OpenRoot – root DirectoryEntry object already opened for this computer for this object
Slp: Sco.User.LookupADEntry – Attempting to find user account DOMAIN\account
Slp: Sco: Attempting to check if container ‘WinNT://DOMAIN’ of user account exists
Slp: UserSecurity.ValidateCredentials — Exception caught and ignored, exception is Access is denied.
Slp: UserSecurity.ValidateCredentials — user validation failed

I’ve highlighted the problem section. As you can see our account has some permissions on the domain and successfuly gets the SID and various other tasks. However when it comes to the method

Attempting to check if container ‘WinNT://DOMAIN’ of user account exists

it fails…and moreover it then swallows the exception, and then to my amusement actually records that it’s swallowed the exception! To me this is really strange, I guess you could argue that the exception is security related and therefore is swallowed for security protection, but then it records what the error is in the log file, so this seems rather unintuitive to me. The bottom line here is that the account in question doesn’t have the specific privileges on our domain that SQL setup wants here, and so it fails and reports that the service account is invalid. In fact the service account is not invalid, the account used to lookup the service account is invalid. In my mind what should really happen here is that you should get a standard windows AD credentials challenge as the process has caught and handled an access denied error, meaning that it could present this information to the user. But hey, that’s just my opinion.

At the end of the day we changed the setup to use a different account with higher privileges (by running setup with a different logged on user) and everything worked just fine. The key here is that the error is misleading, it’s the interactive account under which you are running setup which has the problem, not the service account you’re trying to add.

SQL Server 2000 cannot start after windows update reboot

Last night we had an incident with one of our customers’  old SQL Server 2000 instances. The machine in question had had it’s WSUS windows update run last night and had been forced to reboot after this had occurred. After this happened the SQL Server service refused to start and just got stuck in a cycle of permanent restarts. When we looked at the error logs we saw the following repeated symptom.

2011-09-12 11:33:00.59 server    Microsoft SQL Server  2000 – 8.00.2039 (Intel X86)
May  3 2005 23:18:38
Copyright (c) 1988-2003 Microsoft Corporation
Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

2011-09-12 11:33:00.59 server    Copyright (C) 1988-2002 Microsoft Corporation.
2011-09-12 11:33:00.59 server    All rights reserved.
2011-09-12 11:33:00.59 server    Server Process ID is 1284.
2011-09-12 11:33:00.59 server    Logging SQL Server messages in file ‘C:\Program Files (x86)\Microsoft SQL Server\MSSQL\log\ERRORLOG’.
2011-09-12 11:33:00.59 server    SQL Server is starting at priority class ‘normal'(4 CPUs detected).
2011-09-12 11:33:00.60 server    SQL Server configured for thread mode processing.
2011-09-12 11:33:00.62 server    Using dynamic lock allocation. [2500] Lock Blocks, [5000] Lock Owner Blocks.
2011-09-12 11:33:00.68 server    Attempting to initialize Distributed Transaction Coordinator.
2011-09-12 11:33:02.84 spid3     Starting up database ‘master’.
2011-09-12 11:33:02.85 server    Using ‘SSNETLIB.DLL’ version ‘8.0.2039’.
2011-09-12 11:33:02.85 spid5     Starting up database ‘model’.
2011-09-12 11:33:02.85 spid3     Server name is ‘XXXXXXXXXX’.
2011-09-12 11:33:02.85 spid8     Starting up database ‘msdb’.
2011-09-12 11:33:02.85 spid9     Starting up database ‘removedforsecurity’.
<SNIP for brevity>
2011-09-12 11:33:02.85 spid23    Starting up database ‘removedforsecurity’.
2011-09-12 11:33:02.85 spid5     Bypassing recovery for database ‘model’ because it is marked IN LOAD.
2011-09-12 11:33:02.85 server    SQL server listening on <removedforsecurity>: xxxx.
2011-09-12 11:33:02.85 server    SQL server listening on <removedforsecurity>: xxxx.
2011-09-12 11:33:02.87 spid20    Starting up database ‘removedforsecurity’.
2011-09-12 11:33:02.89 spid5     Database ‘model’ cannot be opened. It is in the middle of a restore.

The important line here is the last one. Model cannot be opened as it’s marked IN LOAD, this in turn means that tempdb can’t be created which in turn means that the service cannot start.

Quite how model became to be status IN LOAD we’ll never know. I went back through the logs and there was nothing suspicious and no-one has actually attempted to restore it. Circumstantially the evidence points to something being corrupted by windows update, as this is when the problem started, but retrospectively we’ll never be able to say. The MDF and LDF files themselves were intact and seemingly OK, so it was time for a manual attempt to try and get the service back.

The way to recover this (and to troubleshoot it initially) is rather dirty. You have to start the server in single user mode at the console and pass it a couple of trace flags to get you into the system catalogues. You should only ever do this as a last result, but this was a last resort, as it was either do this, or system state restore the entire machine. (I couldn’t restore the SQL Server database as it wouldn’t actually start.) The nature of starting a service in console / single user mode is common enough for SQL machines that won’t start, as you get stuck in a chicken and egg scenario of needing to start it to see why it won’t start! However update the system tables are your peril. In later versions of SQL Server (2005 upwards) the system tables are actually obscured behind views in a separate protected database to make this harder to do, although you can still do it.

The following KB article has a good description of how to do this.

http://support.microsoft.com/kb/822852

The symptoms are the same but the cause was different. That said the solution is also the same in that you have to manually update the status of the database in sysdatabases and then restart the server. There’s no guarantee of success here as it could have been that the MDF file was actually corrupted, but luckily for me it did work and I was saved during a full system restore of the server.

Cannot add new node to windows hyper-v cluster–SCSI disk validation error

We make extensive use of Hyper-V within Basefarm and I recently encountered a strange problem when doing maintenance on a 5 node windows cluster running the hyper-v role. For reasons I won’t bore you with here (pre-production testing basically), I had been evicting and adding nodes to my host level windows cluster, but when trying to add a node back I encountered errors in the validation tests. This was strange as the node had actually previously been in the cluster and nothing had been changed on it whatsoever, so I already knew that it had previously passed the validation test and been successfully running as a cluster node!

The cluster validation reported an error of this format in the storage tests named

Validate SCSI device Vital Product Data (VPD)

cluster_validate_error

It looks like the above picture when you view the validation report.

The actual errors returned were like this:

Failed to get SCSI page 83h VPD descriptors for cluster disk 1 from node <nodename> status 2

(I’ve removed the node name here obviously but it does say specifically which one in the report):

Fortunately before too long I found that this was due to a bug in hyper-v validation for which a hotfix is available here

http://support.microsoft.com/kb/2531907

Downloading and installing this on all the nodes and potential new nodes resolves the error.

This goes to show the value of pre-production testing as the aim of this cluster is to provide a dynamically expanding virtualisation service for one of our largest customers. If it came to a production situation where I needed to add a node quickly, this is not the type of scenario I would be wanting to troubleshoot live!

How to resolve orphaned file ownership in windows 2008

Here’s something I was looking at this morning, which is not an uncommon problem I think. I was doing some disk cleaning for a customer and some extremely large files that we needed to remove were locked and inaccessible even by a local administrator on the machine. There are multiple reasons why this can occur, but in this case it was because the original file was created by a domain user account which had subsequently been disabled.

This meant that if you tried to delete the file, or take ownership of it using standard windows controls you would be refused access.

In this case I resolved using the takeown command line tool that comes directly in windows

http://technet.microsoft.com/en-us/library/cc753024(WS.10).aspx

It was actually a 3 stage process.

1. Run takeown with the /F switch to grant ownership to the current user (if you want you can grant ownership to a different user) but I simply logged on as a local administrator and took ownership using this login.

2. Following this you have ownership but you still can’t delete or move the file. You need to go into the security tab and explicitly grant access to a login (can include yourself) to give the new user full control

image

3. Once this is complete you can simply delete the file using the account you have granted permission to. (Before using takeown you are unable to access this security tab.)

It’s probably more sensible to grant the permission to an administrative group in the longer term, if you’re not just simply deleting the files. It’s worth noting as well that takeown can do the same functions to groups of files or directories themselves, all of which can experience the same problem.