Detach CSI Volumes After Non-Graceful Node Shutdown

If a node shuts down unexpectedly, the kubelet might not report the shutdown event to the control plane. In this case, Pods that use CSI-backed persistent volumes can remain stuck on the failed node, and the volumes might not detach automatically. In Alauda Container Platform, you can manually add an out-of-service taint to trigger Pod eviction and volume detachment.

When to use this procedure

Use this procedure when all of the following conditions are true:

A worker node has stopped because of a power loss, operating system failure, or hardware fault.
The node did not complete a graceful shutdown.
Workloads that use CSI-backed persistent volumes cannot restart on another node because the volumes are still attached to the failed node.

Prerequisites

You know the name of the affected node.
The affected node is completely powered off or otherwise confirmed to be unavailable.

WARNING

Do not apply the out-of-service taint to a node that is still running or still recovering. If the node is not fully shut down, forcing volume detachment can cause file system corruption or application-level data loss.

Procedure

Check the status of the affected node.
kubectl get node <node_name>
If the node is still partially reachable, shut it down completely at the infrastructure layer before you continue.
Add the out-of-service taint to the node.
kubectl taint node <node_name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
After the taint is applied, Kubernetes starts evicting Pods from the failed node. The attach-detach controller can then detach CSI volumes so that replacement Pods can attach the same volumes on another node.

WARNING

The NoExecute effect can cause Pods to be deleted and recreated on other nodes. If the workload is managed by a StatefulSet, the replacement Pod starts only after the original volume is detached successfully.

Verify the taint

Run the following command:

kubectl describe node <node_name>

Verify that the node includes the following taint:

node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

You can also check whether the affected Pods are being rescheduled and whether the CSI volumes are detached from the failed node.

Remove the taint after node recovery

After the node is repaired and is ready to return to service, remove the taint:

kubectl taint node <node_name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-

If the node was replaced instead of recovered, you can remove the old Node object from the cluster after confirming that workloads are healthy on the replacement node.

#Detach CSI Volumes After Non-Graceful Node Shutdown

#TOC

#When to use this procedure

#Prerequisites

#Procedure

#Verify the taint