We recently upgraded our vCenter servers to vSphere 5.1.
With vSphere 5.1 there is an considerable change in how vCenter server is installed. The vCenter server is now composed out of the Single Sign On Server service, the Inventory Server service and the vCenter Server service. Authorization to the vCenter instance is now done against the Single Sign On server (SSO) and by default the local administrators group of the vCenter server is no longer administrator on the vCenter server instance. It is recommended to use an AD group as administrator to your vCenter instance.
About 2 weeks ago we installed the new vCenter servers. Following the install directions we installed the SSO server, followed by the Inventory server for the first vCenter and then installed the first vCenter server. All on one machine. After the install there was a linked mode error, wich could easily be removed. We added the AD group for administrators to the top level of the vCenter instance, since we forgot to do that during the install. The next day we installed the second vCenter server, pointing to the SSO on the first vCenter for authentication.
Because we wanted to be able to use the vSphere client and not just the vSphere webclient, we added the servers to a linked mode group. We now could see both servers from the client, but, oddly enough, only when we connected to the first vCenter server. Other than that, everything seemed fine. Some hickups with security groups that were renamed in AD, after the fact, but nothing that couldn’t be resolved.
Untill last friday, when, after I applied the microsoft patches and rebooted the servers, the first vCenter server service wouldn’t start anymore. The second vCenter server was fine, indicating that it might not be related to the microsoft patches. And, yes, I did check wether Microsoft Security Advisory update KB2661254 was perhaps installed by mistake. This was not the case so the SSL certificates were OK (VMware KB 20370)
When I checked the windows event log I found the following message:
|The description for Event ID 1000 from source VMware VirtualCenter Server cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.|
According to VMware KB 2015824 this message suggests an error in the connection to the vCenter database.
I checked all the points mentioned in the KB 2015824 and then restarted the vCenter service
- the DSN
- Log on As in Windows for the VMware VirtualCenter Server service
- the database username in the registry
- Reset the encrypted password
Everything was fine, but the vCenter service still wouldn’t start. Actually it would start, but then stop immediately, with a very depressing message from microsoft about how some services do this when they are not being used.
After googling a bit I found a KB from VMware that describes this problem. According to VMware KB 1008114 when the vCenter server service starts and immediately stops and comparable messages to the ones below are found in the vpxd log, it might be that the rights for the administrators group are changed from “administrator” to “Virtual Machine user”.
|[2009-01-05 19:36:37.130 'App' 12216 error] Failed to add default permission: permission already exists
[2009-01-05 19:36:37.130 'App' 12216 error] Cannot start authorize – system has no access rules
[2009-01-05 19:36:37.146 'App' 12216 error] [Auth] Failed to initialize: <Authorize Exception>
[2009-01-05 19:36:37.146 'App' 12216 error] Failed to initialize security
[2009-01-05 19:36:37.146 'App' 12216 info] Shutting down VMware VirtualCenter…
[2009-01-05 19:36:38.146 'App' 12216 info] [VpxdServer] Exit done..
According to KB 1008114 this can be resolved by editing the VPX_ACCESS table in the vCenter database.
- Log into SQL 2005/2008 with SQL Management Studio
- Start a new query in the SQL Management Studio: Select * from vpx_access
- Execute de query against the vCenter Database
- Check the VPX_ACCESS table and check if the ROLE_ID voor Administrators is -1. When it isn’t -1, change the value to -1
Nb: This can be done on a SQL2008 server by selecting the Edit Top 200 Rows Option by right clicking the VPX_ACCESS Table
VPX_ACCES table content explained:
ID = unique identifier
PRINCIPAL = the user or group
ROLE_ID = the role appointed within vCenter (-1 is administrator)
ENTITY_ID = the ID of the object within vCenter server where the role is assigned (VPX_ENTITY tabel)
FLAG = marks a user or a group (3 is a group, 1 is a user)
After editing the role of the administrators, restart the VMware vCenter Server Service.
Following the instructions in the KB I tried to do that. But, to my surprise, the table was empty. No users, groups or roles to edit.
Now in vCenter 5.1 a Single Sign On server is used to authenticate. The SSO will remove the local\administrators access. So I didn’t expect administrators to be in the VPX_ACCESS table, but neither did I expect it to be empty. When I checked the VPX_ACCESS table of the other vCenter server, I noticed the table wasn’t empty and there should be severall entries. Using the correct VPX_ACCESS table from the other vCenter server I added the users and groups that should be present to the VPX_ACCES table of the troubled server and restarted the vCenter server service. Only to find, the restart had emptied the table again. Now how is this possible? What’s hapening here?
After some further research I found it was allso important to check the ports used for reading the Active Directory Lightweight Directory Services Instance. This is explained in VMware KB 1023864.
I checked both ports in the registry:
Value: Port LDAP
Data: 1 – 65535 (default: 389)
Value: Port SSL
Data: 1 – 65535 (default: 636)
When the Port SSL isn’t a REG_DWORD , but REG_SZ, you have to remove this port and recreate it as a REG_DWORD with a decimal value of 636.
After some extra reading I allso found that this is true, unless, it is a vCenter server in linked mode, in wich case port 636 is used for the local instance of the linked mode server. The SSL port can be removed or moved to any port from 1025 tot 65535 (required port vCenter 5.1).
I our case the vCenter was installed without changing the default administrators group to the AD group created for this purpose, nor was the default port adjusted to 1025 (or higher). Upon checking the port it turned out to be a REG_SZ entry. So I removed it an recreated it with a decimal value of 1025. This, however, did not resolve the issue and the VPX_ACCESS table remained empty.
I decided my only option at this point was to deinstall and reinstall the vCenter server and checked if this would be a problem with the remaining linked mode vCenter server. This was supported (linked mode), so I proceeded to deinstall and reinstall the Vcenter server, using the correct AD group during install and changed the default SSL port setting to 1025 during the install
If you need to uninstall and reinstall vCenter Server on more than one member of a Linked Mode group, do so with a single vCenter Server at a time. Uninstalling and reinstalling multiple linked vCenter Servers at the same time is not supported, and can cause errors that prevent vCenter Server from connecting to vCenter Inventory Service. If it is necessary to uninstall and reinstall multiple linked vCenter Servers at the same time, isolate them from the Linked Mode group first, and rejoin them to the Linked Mode group after the reinstallation is complete.
After the reinstall, to my relief, the vCenter 5.1 server service started. I checked all the security information on the vCenter instance, and, as expected, at top level was only the AD group added during the install. On the lower level folders with permissions added, the groups were, strangely enough, still there. Wich I did not expect, since the VPX_ACCESS table was empty. I corrected all rights and then checked the linked mode status. Our second server still thought we were in linked mode, but the newly installed server didn’t. When I checked the vSphere webclient I noticed that somehow our vCenters (and all other objects) had doubled. I gathered this was problably an artifact from the linked mode status, wich hadn’t been corrected yet and figured I would first correct the linked mode group and if this problem would still exist after that, there would be plenty time left to worry about that one.
After readding the newly installed vCenter server to the linked mode group with the linked mode wizard, all was well again and the webclient showed the correct number of objects again.
This exercise took the entire weekend, trying to unravell all aspects of this odd server crash. I find it very disheartening that the defaults used in the install wizard of the vCenter (administrators and SSL port 636) could lead to such a major server crash. But I am glad that, in the end, it was easily fixed.
On a sidenote I would like to add that my collegue ran into the exact same error a week ago, when he replaced the SSL certifcates on our test vCenter server (no linked mode, standalone server) This server allso has the SSL port set to REG_SZ instead of a REG_DWORD and the same omission was made during the initial install with the replacement of the administrators group. He fixed this by reinstalling the entire server after he restored the database. I am curious to see if it could allso have been fixed with just a reinstall of the vCenter and changing the defaults during the install (AD group and SSL port to 1025). I may have to try that soon.