SQL Server 2008 R2 setup fails due to invalid credentials

A colleague was trying to install a SQL Server 2008 R2 standalone instance today and he kept hitting a weird error when passing the credentials for the service accounts. He was running the setup as a local administrator on the server in question and he was trying to add a domain user as the service account.

The server in question was joined to the appropriate domain already and it had been checked that all the appropriate firewall ports were open. We knew that the user account he was using could query the domain structure as it was able to browse the domain to select the user account from the setup screen.

The errors we got for both accounts were (the text is truncated on the form)

The credentials you provided for the SQL Server Agent service are invalid. To continue…

The specified credentials for the SQL Server service are not valid. To continue…

We took a methodical troubleshooting approach and tested some other domain accounts which we knew were valid and were running other SQL instances elsewhere. These failed as well, meaning that we must have been encountering either some unexpected behaviour within this setup session, or we were being blocked from talking to the domain controllers in some fashion. We again checked the firewalls and they were confirmed as OK.

Then we went into the set logs. The summary log just showed that we had cancelled the installation. For example you see stacks like this which tells you nothing really

2011-09-12 15:59:21 Slp: Exception type: Microsoft.SqlServer.Chainer.Infrastructure.CancelException
2011-09-12 15:59:21 Slp:     Message:
2011-09-12 15:59:21 Slp:         User has cancelled.

However if you look in the detail.txt log file, contained one directory lower, you can scroll from the bottom up and find the actual cause of the problem in the page validation routines. It looks like this (I’ve removed the timestamps for better readability and also blanked all the identifying information obviously)

SQLEngine: –InputValidator: Engine : Attempting to get account sid for account DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account from sid S-1-5-21-999999999-999999999-999999999-9999
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN
Slp: Sco: Attempting to get sid for user account DOMAIN
Slp: Sco: GetSidForAccount normalized accountName DOMAIN parameter to DOMAIN
SQLEngine: –InputValidator: Engine : Service Acccount Specified, Validating Password
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to validate credentials for user account DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN\account
Slp: Sco: Attempting to get sid for user account DOMAIN\account
Slp: Sco: GetSidForAccount normalized accountName DOMAIN\account parameter to DOMAIN\account
Slp: Sco: Attempting to get account sid for user account DOMAIN
Slp: Sco: Attempting to get sid for user account DOMAIN
Slp: Sco: GetSidForAccount normalized accountName DOMAIN parameter to DOMAIN
Slp: Sco: Attempting to see if user DOMAIN\account exists
Slp: Sco.User.OpenRoot – Attempting to get root DirectoryEntry for domain/computer ‘DOMAIN’
Slp: Sco: Attempting to check if user account DOMAIN\account exists
Slp: Sco: Attempting to look up AD entry for user DOMAIN\account
Slp: Sco.User.OpenRoot – root DirectoryEntry object already opened for this computer for this object
Slp: Sco.User.LookupADEntry – Attempting to find user account DOMAIN\account
Slp: Sco: Attempting to check if container ‘WinNT://DOMAIN’ of user account exists
Slp: UserSecurity.ValidateCredentials — Exception caught and ignored, exception is Access is denied.
Slp: UserSecurity.ValidateCredentials — user validation failed

I’ve highlighted the problem section. As you can see our account has some permissions on the domain and successfuly gets the SID and various other tasks. However when it comes to the method

Attempting to check if container ‘WinNT://DOMAIN’ of user account exists

it fails…and moreover it then swallows the exception, and then to my amusement actually records that it’s swallowed the exception! To me this is really strange, I guess you could argue that the exception is security related and therefore is swallowed for security protection, but then it records what the error is in the log file, so this seems rather unintuitive to me. The bottom line here is that the account in question doesn’t have the specific privileges on our domain that SQL setup wants here, and so it fails and reports that the service account is invalid. In fact the service account is not invalid, the account used to lookup the service account is invalid. In my mind what should really happen here is that you should get a standard windows AD credentials challenge as the process has caught and handled an access denied error, meaning that it could present this information to the user. But hey, that’s just my opinion.

At the end of the day we changed the setup to use a different account with higher privileges (by running setup with a different logged on user) and everything worked just fine. The key here is that the error is misleading, it’s the interactive account under which you are running setup which has the problem, not the service account you’re trying to add.

SQL Server 2000 cannot start after windows update reboot

Last night we had an incident with one of our customers’  old SQL Server 2000 instances. The machine in question had had it’s WSUS windows update run last night and had been forced to reboot after this had occurred. After this happened the SQL Server service refused to start and just got stuck in a cycle of permanent restarts. When we looked at the error logs we saw the following repeated symptom.

2011-09-12 11:33:00.59 server    Microsoft SQL Server  2000 – 8.00.2039 (Intel X86)
May  3 2005 23:18:38
Copyright (c) 1988-2003 Microsoft Corporation
Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

2011-09-12 11:33:00.59 server    Copyright (C) 1988-2002 Microsoft Corporation.
2011-09-12 11:33:00.59 server    All rights reserved.
2011-09-12 11:33:00.59 server    Server Process ID is 1284.
2011-09-12 11:33:00.59 server    Logging SQL Server messages in file ‘C:\Program Files (x86)\Microsoft SQL Server\MSSQL\log\ERRORLOG’.
2011-09-12 11:33:00.59 server    SQL Server is starting at priority class ‘normal'(4 CPUs detected).
2011-09-12 11:33:00.60 server    SQL Server configured for thread mode processing.
2011-09-12 11:33:00.62 server    Using dynamic lock allocation. [2500] Lock Blocks, [5000] Lock Owner Blocks.
2011-09-12 11:33:00.68 server    Attempting to initialize Distributed Transaction Coordinator.
2011-09-12 11:33:02.84 spid3     Starting up database ‘master’.
2011-09-12 11:33:02.85 server    Using ‘SSNETLIB.DLL’ version ‘8.0.2039’.
2011-09-12 11:33:02.85 spid5     Starting up database ‘model’.
2011-09-12 11:33:02.85 spid3     Server name is ‘XXXXXXXXXX’.
2011-09-12 11:33:02.85 spid8     Starting up database ‘msdb’.
2011-09-12 11:33:02.85 spid9     Starting up database ‘removedforsecurity’.
<SNIP for brevity>
2011-09-12 11:33:02.85 spid23    Starting up database ‘removedforsecurity’.
2011-09-12 11:33:02.85 spid5     Bypassing recovery for database ‘model’ because it is marked IN LOAD.
2011-09-12 11:33:02.85 server    SQL server listening on <removedforsecurity>: xxxx.
2011-09-12 11:33:02.85 server    SQL server listening on <removedforsecurity>: xxxx.
2011-09-12 11:33:02.87 spid20    Starting up database ‘removedforsecurity’.
2011-09-12 11:33:02.89 spid5     Database ‘model’ cannot be opened. It is in the middle of a restore.

The important line here is the last one. Model cannot be opened as it’s marked IN LOAD, this in turn means that tempdb can’t be created which in turn means that the service cannot start.

Quite how model became to be status IN LOAD we’ll never know. I went back through the logs and there was nothing suspicious and no-one has actually attempted to restore it. Circumstantially the evidence points to something being corrupted by windows update, as this is when the problem started, but retrospectively we’ll never be able to say. The MDF and LDF files themselves were intact and seemingly OK, so it was time for a manual attempt to try and get the service back.

The way to recover this (and to troubleshoot it initially) is rather dirty. You have to start the server in single user mode at the console and pass it a couple of trace flags to get you into the system catalogues. You should only ever do this as a last result, but this was a last resort, as it was either do this, or system state restore the entire machine. (I couldn’t restore the SQL Server database as it wouldn’t actually start.) The nature of starting a service in console / single user mode is common enough for SQL machines that won’t start, as you get stuck in a chicken and egg scenario of needing to start it to see why it won’t start! However update the system tables are your peril. In later versions of SQL Server (2005 upwards) the system tables are actually obscured behind views in a separate protected database to make this harder to do, although you can still do it.

The following KB article has a good description of how to do this.

http://support.microsoft.com/kb/822852

The symptoms are the same but the cause was different. That said the solution is also the same in that you have to manually update the status of the database in sysdatabases and then restart the server. There’s no guarantee of success here as it could have been that the MDF file was actually corrupted, but luckily for me it did work and I was saved during a full system restore of the server.

Alltid kul med nytänkande

Med referens till artikeln i IDG http://www.idg.se/2.1085/1.403348/serverrack-ska-radda-jatte-i-kris .

Nöden är väl ändå uppfinningens moder? Trots alla stora och återkommande framsteg som skett genom åren på prestandasidan i form av minne, cpu och format, har väldigt lite hänt på själva plattformssidan. En server anno 2001 är väldigt lik en server idag. Administration och automation av infrastruktur har alltid varit något av en efterkonstruktion vid framtagandet av t.ex. serverprodukter och vi som arbetar med just automation och drift vill gärna uppmuntra fler leverantörer att följa Ciscos exempel.

Att göra en server tillgänglig för produktion innefattar många fler steg än att bara montera servern i racket och sedan slå på strömbrytaren, snarare handlar det om ett 50-tal moment som ska utföras och testas innan man ger tummen upp för produktion. Verktyg som underlättar för oss att kunna systematiskt integrera dom i våra befintliga system och arbetsflöden för att sedan bara kunna klicka på “Kör” stärker både kvaliteten och flexibiliteten för våra kunder.

För att gå tillbaka till framtiden, så vill jag tro att många av oss går och väntar på äkta multitenancy även på hårdvarusidan, där vi kan avbrottsfritt tillföra eller frigöra maskinkraft på låg nivå, men för att då oundvikligen snegla mot stordatorvärlden, så förutsätts då att både hårdvara, OS och applikationer är framtagna och underhålls i harmoni. Och, den moderna termen för detta är väl ändå Platform as A Service (PaaS) och vi får hoppas på att utvecklare hoppar på tåget och går i den riktningen. Än så länge är adaptionen av PaaS väldigt låg för nya produkter och tjänster på nätet och man fastnar i labb-stadiet.

Så, för att återgå i ämnet, är UCS då räddningen för Ciscos framtid? Vem vet, när fler operatörer börjar bygga PaaS plattformar så är Cisco UCS en tilltalande infrastruktur att bygga den kring. Kommer UCS produkten tilltala IT-chefen som köper en server i månaden och har inga automatiserade arbetsflöden eller ambitioner, förmodligen inte. 🙂

/Stefan Månsby

Cannot add new node to windows hyper-v cluster–SCSI disk validation error

We make extensive use of Hyper-V within Basefarm and I recently encountered a strange problem when doing maintenance on a 5 node windows cluster running the hyper-v role. For reasons I won’t bore you with here (pre-production testing basically), I had been evicting and adding nodes to my host level windows cluster, but when trying to add a node back I encountered errors in the validation tests. This was strange as the node had actually previously been in the cluster and nothing had been changed on it whatsoever, so I already knew that it had previously passed the validation test and been successfully running as a cluster node!

The cluster validation reported an error of this format in the storage tests named

Validate SCSI device Vital Product Data (VPD)

cluster_validate_error

It looks like the above picture when you view the validation report.

The actual errors returned were like this:

Failed to get SCSI page 83h VPD descriptors for cluster disk 1 from node <nodename> status 2

(I’ve removed the node name here obviously but it does say specifically which one in the report):

Fortunately before too long I found that this was due to a bug in hyper-v validation for which a hotfix is available here

http://support.microsoft.com/kb/2531907

Downloading and installing this on all the nodes and potential new nodes resolves the error.

This goes to show the value of pre-production testing as the aim of this cluster is to provide a dynamically expanding virtualisation service for one of our largest customers. If it came to a production situation where I needed to add a node quickly, this is not the type of scenario I would be wanting to troubleshoot live!