I’ve setup a number of Windows Server 2008 (and higher) Failover Clusters in the past and this is the first time I’m seeing this. I’m running the Failover Cluster Validation wizard on two servers that will be used as nodes for a Windows Failover Cluster. When I run the wizard from ServerB, everything works fine. But when I try to run it on ServerA, I get blocked on the step to add servers to validate with the following error message.

An error occurred get the cluster node state for “ServerB.” Access is denied.
This is the error message when I run the Failover Cluster Validation Wizard from ServerA and adding ServerB.

I can’t find anything from the Windows Error Log to lead me to where to find that specific permission issue. I even ran ProcessMonitor to check for possible permission issues on the file system as well as the registry. When I generate the cluster log file, I get the error message below

System error 2 has occurred (0x00000002).
The system cannot find the file specified.

Note that I don’t even have a cluster yet, I’m just running the validation wizard before creating one.  A number of results popped up from Google, one even recommending using a Windows Server 2003 Active Directory which I think is a very drastic approach to resolve this issue. I thought rebuilding the OS from scratch for both of the servers would fix it as they were both from a virtual machine image used for easier deployment. I’m a strong believer of making sure you have a clean Windows Failover Cluster configuration before going-live.  However, the OS rebuild didn’t fix it.

Frustration has gotten into me with the fact that I can’t figure out what’s wrong so I opened up a case with Microsoft. After hours of investigation, the Microsoft engineer finally found the culprit. The reason why the Failover Cluster Validation Wizard was not even allowing me to add one of the servers in the cluster was because of the difference in the system date. ServerA had a system date of 21-Jun-2011 while ServerB had a system date of 22-Jun-2011 – off by exactly 24 hours. Both servers have the same system time and time zone configuration.

Log Name: System
Source: GroupPolicy
Event ID: 1126
User: SYSTEM
OpCode: (1)
Logged: 6/21/2011 5:19:49 PM
Task Category: NONE
Computer: SERVERB
Windows was unable to determine whether new Group Policy settings defined by a network administrator should be enforced for this
user or computer because this computer’s clock is not synchronized with the clock of one of the domain controllers for the domain.
Because of this issue, this computer system may not be in compliance with the network administrator’s requirements, and users of
this system may not be able to use some functionality on the network. Windows will periodically attempt to retry this operation, and
it is possible that either this system or the domain controller will correct the time settings without intervention by an administrator,
so the problem will be corrected.

If this issue persists for more than an hour, checking the local system’s clock settings to ensure they are accurate and are synchronized
with the clocks on the network’s domain controllers is one way to resolve this problem. A network administrator may be required to
resolve the issue if correcting the local time settings does not address the problem.

After the system date was corrected on ServerB, the failover cluster validation wizard went thru smoothly. Curiosity has gotten into me that I’ve decided to reproduce this in my test environment. I have two Windows Server 2008 R2 servers and I configured one with a system date that is 24 hours ahead of the other. Running the Windows Failover Cluster Validation Wizard gave me a different error message this time.

This was actually the reason why I checked for the Remote Registry service on both nodes the first time I ran the Failover Cluster Validation Wizard. This MSDN article highlights all of the requirements you need before you install a SQL Server Failover Cluster (yes, this is for a SQL Server setup). But what’s fascinating is the fact that  servers that are members of an Active Directory domain uses the PDC Emulator that runs the FSMO role server as their default NTP server. If the system time accidentally got changed by an administrator, a reboot will automatically correct the system time. This is exactly what happened when I rebooted the server that has an incorrect system time – the reboot automatically corrected the server system time.

I didn’t get the chance to dig deeper on why the system time on the servers I was working on got changed. Besides, I do not have access to their domain controllers to even look. But this is something to watch out for when deploying a Windows Failover Cluster. Who would ever think that an incorrect system time would be a showstopper in building a Windows Failover Cluster? I just wish there was a more intuitive error message that would tell me what the real problem was instead of banging my head and losing more hair trying to figure out which ACLs and permissions were not granted because of the “Access is Denied” error message.

Advertisements