Bad Day in Kerberos Land
At a remote site, users began losing their connections to file shares and some internal web sites. A few users were seemingly unaffected however. The remote site had a single domain controller running as a virtual machine on a VMware host.
Examining the event logs on the local site domain controller brought forth a plague of misery. There were numerous errors with an event ID of 142 (“Time service has stopped advertising as a time source because the local clock is not synchronized”). As the desktops on the remote site were under SCCM control, a quick check was made using the SCCM status monitors which confirmed that some computers on the site had not received their malware signature updates as they had been running up to 677 seconds ahead of the SCCM server time thus causing Kerberos authentication to fail. So, the time on the remote domain controller was resynchronised and all was apparently well but, unfortunately, the file access problem returned a few hours later.
Back to the event logs. A whole pile of errors with an event IDs of 6008 (“Unexpected shutdown”) and 41 (“System rebooted without clean shutdown”) were found from the previous day and, after bit of investigation, it was determined that the electricians had been “working” on the building power supply and there had been “one or two” power outages. The virtual host server on the remote site was protected by a standalone UPS but its battery had clearly ran out of charge after the fourth consecutive power outage. Now I had the likely root cause and it was more than possible that Active Directory database on the domain controller had been trashed by these outages. Checking the event logs for more proof found a significant number of events with an ID of 4 (“KRB_AP_ERR_MODIFIED”) and number of 16962 events (“Remote calls to the SAM database are being restricted”). Interestingly, the target of most of the event IDs of 4 was a domain controller on the main site that was, of course, working perfectly.
One final test. It was reasonable to assume that, given that Active Directory was sub-optimal on the remote domain controller, that the secure channel to the domain was broken and the domain controller would need to be demoted and re-joined to the domain. However, luck was on my side and the PowerShell cmdlet Test-ComputerSecureChannel verified that the secure channel from the wounded domain controller to every other domain controller was fine.
The diagnosis was complete and it was decided to demote the remote domain controller, wait for replication convergence and repromote it. A simple plan but not without its problems.
Any attempt to demote the remote domain controller – either gracefully or by “Force Removal” – failed with a “target account name is incorrect” also known as a Kerberos error and there is a certain irony in the fault that you are trying to correct stopping you from correcting the fault.
Fortunately, the resolution was straightforward. The Kerberos Key Distribution Center service on the remote domain controller was stopped and the demotion could then be completed without issue. After an hour, replication had removed the remote domain controller from Active Directory and it could safely be repromoted.
Back to business as usual. Job done!