Detach CSI Volumes After Non-Graceful Node Shutdown
If a node shuts down unexpectedly, the kubelet might not report the shutdown event to the control plane. In this case, Pods that use CSI-backed persistent volumes can remain stuck on the failed node, and the volumes might not detach automatically. In Alauda Container Platform, you can manually add an out-of-service taint to trigger Pod eviction and volume detachment.
TOC
When to use this procedurePrerequisitesProcedureVerify the taintRemove the taint after node recoveryWhen to use this procedure
Use this procedure when all of the following conditions are true:
- A worker node has stopped because of a power loss, operating system failure, or hardware fault.
- The node did not complete a graceful shutdown.
- Workloads that use CSI-backed persistent volumes cannot restart on another node because the volumes are still attached to the failed node.
Prerequisites
- You know the name of the affected node.
- The affected node is completely powered off or otherwise confirmed to be unavailable.
Do not apply the out-of-service taint to a node that is still running or still recovering. If the node is not fully shut down, forcing volume detachment can cause file system corruption or application-level data loss.
Procedure
-
Check the status of the affected node.
-
If the node is still partially reachable, shut it down completely at the infrastructure layer before you continue.
-
Add the
out-of-servicetaint to the node.After the taint is applied, Kubernetes starts evicting Pods from the failed node. The attach-detach controller can then detach CSI volumes so that replacement Pods can attach the same volumes on another node.
The NoExecute effect can cause Pods to be deleted and recreated on other nodes. If the workload is managed by a StatefulSet, the replacement Pod starts only after the original volume is detached successfully.
Verify the taint
Run the following command:
Verify that the node includes the following taint:
You can also check whether the affected Pods are being rescheduled and whether the CSI volumes are detached from the failed node.
Remove the taint after node recovery
After the node is repaired and is ready to return to service, remove the taint:
If the node was replaced instead of recovered, you can remove the old Node object from the cluster after confirming that workloads are healthy on the replacement node.