I was building a new SQL Server 2008 R2 failover cluster recently and encountered a problem that I hadn’t seen before (which is rare as I’ve seen A LOT of cluster setup problems in my time!). This time it was strange as it was an error before setup actually ran, it was when I was going through the dialogue boxes to configure setup.
The scenario was this:
1. Cluster was fully built and validated at a windows level, all resources were up and OK
2. I was about to run SQL Setup when I noticed the network binding order was wrong
3. I changed this and then decided to reboot both nodes as I always do this before a cluster setup
4. The nodes came back online OK and all resources came up as well
5. I ran setup but when I got to the cluster network configuration dialog box, there were no networks to select from, so you couldn’t go forward.
My first thought was that I must have done something dumb when changing the network binding order but checks on the network adapters showed that they were all up. I then went back through a few other things and noticed that the cause of the error was actually that the cluster service was having issues with connecting to one of the networks. There were 2 types of error / warning in the cluster logs and the system event logs:
Cluster network ‘xxxxx’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Cluster network interface ‘xxxxx – xxxxx’ for cluster node ‘xxxxx’ on network ‘xxxxx’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
I had to engage the help of some network specialists as I couldn’t get to the bottom of this on my own. The networks actually appeared up and we could connect to them and use them independently outside of the cluster, but the cluster was convinced that they were partitioned. To cut a long story short, after checking many things we realised that the problem was down to the fact that one of the networks was actually a teamed network implemented using BASP virtual adapters, and this network team was not coming up fast enough after the node rebooted, before the cluster service tried to bind it in as a resource.
The fix was simple, in that we set the cluster service to delayed start and then everything was fine. We didn’t need to make any configuration changes beyond this. Once the cluster service was happy that the network was OK, SQL Server setup was able to continue just fine.
Good luck with your cluster builds!