Tanzu: How to recover from orphaned TKC nodes

Kubernetes is quite competent with keeping the desired states of application components like pods thanks to deployments, replicasets, daemonsets, etc. But when it comes to cluster level components like nodes, this can become a challenging task even with the presence of Cluster API.

Today, I wanted to see how it is possible to recover from a situation where something really goes wrong with TKC nodes at the vSphere level. Before that, let me elaborate on what I mean by “something really wrong“.

If you carefully inspect the virtual machines deployed by WCP (Workload Control Plane) at the vSphere layer, you’ll see that the actions you are familiar with traditional VMs (like stop, start, remove and even open console) are grayed out. This is by design and this gives the full control of the VMs to WCP Service. So vi-admins cannot directly stop or remove these VMs accidentally.

Of course, neither a failed ESXi server will cause something spectacular as this will trigger a typical HA scenario and the node will be up and running on a different ESXi host in a minute or two. The scenario is, what if something goes wrong at the storage layer and the VM becomes completely inaccessible. We all know that this can happen.

Triggering a failed node

In order to destroy a TKC node on vSphere, my plan is to trigger that from the ESXi server that it runs on. To find out the ESXi server that the VM is running on is the moooost straight-forward thing that we can imagine, but here I’ll show the Kubernetes way.

  • Login to Supervisor Cluster with default administrator or an authorized sso account.
  • Query virtualmachine resources with kubectl and get host names.
kubectl -n your-namespace get tkc 
kubectl -n your-namespace get virtualmachines -o custom-columns='NAME:.metadata.name,HOST:.status.host'
  • SSH to the ESXi server with root.
  • Get the ID of the virtual machine, stop and then destroy it.
vim-cmd vmsvc/getallvms
vim-cmd vmsvc/power.off <id-of-the-VM>
vim-cmd vmsvc/destroy <id-of-the-VM>

If we just power off the VM without destroying, we’ll see it up in a minute, this means that Supervisor Cluster is taking care of it. But here we’re pushing the limits and at the end, I’d expect to see something like this;

How to recover

This is where we can question why this VM is not recovered automatically if Kubernetes is highly capable of keeping the desired state. My explanation is that we have triggered this at the vSphere (or in Kubernetes terms, infra or cloud provider) layer and Cluster API is not aware of this problem. We can get this idea by getting virtualmachine and machine instances registered in our Supervisor namespace as below;

Even though the VM does not exist and shown as orphaned on vCenter, related resources within Supervisor Cluster seem as poweredOn and Running. Normally, there are another resource types called “machinedeployment” and “machineset” which supposed to create new machines if the reality does not match with desired state but in this case everything seems normal at this level and no action is taken.

I think that this will be something to be fixed in the future releases, until then a simple manual step would suffice.

  • We need cluster-admin privileges on Supervisor Cluster to perform this. Please check out my previous blog post for more details.
  • Delete the machine resource which corresponds to the failed node.
  • This will trigger vCenter to delete the VM from its database (no more orphaned VM).
  • Machineset will also notice that one machine is missing and in order to keep the desired state, it will create another machine resource (and eventually a virtualmachine resource).
  • After a few minutes, the node and the machine resource will be shown as Ready.

This is applicable for TKC worker nodes as well as TKC control plane nodes but not for Supervisor Control Plane VMs.

Note: There is another way to remediate this situation with the help of health checks at the Kubernetes level. This will be the subject of a future post.