Replacing or Removing Storage Nodes
This document describes how to remove a storage node from a Rook-Ceph cluster managed by the Container Platform. Depending on whether the remaining OSDs have sufficient capacity to absorb the data from the node being removed, you may need to add a replacement node first.
Prerequisites
-
All cluster components (except the failing node, if applicable) are functioning properly.
-
Before starting, note how many disks the old node has and which device class each disk belongs to.
Constraints and Limitations
-
In a three-node Ceph cluster, losing one node already reduces redundancy. Complete the procedure as quickly as possible to minimize the risk window.
-
During data rebalancing, cluster I/O performance may be temporarily degraded.
-
Do not proceed if the cluster is in
HEALTH_ERRdue to reasons other than the node being removed. Proceeding in that state may further compromise data resilience.
Procedure
Check Cluster State and Capacity
-
Identify all OSD IDs running on the node to be removed and their disk usage.
-
Verify overall cluster health.
-
Check the capacity of all OSDs.
Sum the USE values of all OSDs on the node to be removed. Then confirm that the sum of AVAIL across all remaining OSDs (on other nodes) is greater than that total. This ensures the remaining OSDs have enough free space to absorb the data after the node is removed.
If remaining capacity is insufficient, proceed to the next step to add a replacement node first. Otherwise, skip to Adjust Component Deployment Configuration.
Add a Replacement Node (If Needed)
If the remaining OSDs do not have enough free capacity, add a replacement node before removing the old one.
-
Enter the Container Platform.
-
Add the replacement machine as a new cluster node using the platform's node management functionality.
-
After the node has joined the cluster, add it as a storage node. Navigate to Storage Management > Distributed Storage > Device Classes.
-
Click Add Device, select the new node, and choose the appropriate disks. If the old node had multiple disks across different device classes, repeat this step for each disk/device-class combination until all disks are added.
-
Wait for the new OSDs to become active and for data rebalancing to complete. Monitor progress:
Wait until the cluster returns to
HEALTH_OKwith no misplaced or recovering PGs before proceeding.
Adjust Component Deployment Configuration
Rook-managed Ceph daemons (MON, MGR, MDS) may be scheduled on the old node. Exclude the old node from component scheduling so the operator reschedules them onto other nodes.
-
In the Container Platform, navigate to Storage Management > Distributed Storage > Storage Components > Component Deployment Configuration.
-
Enable node binding and select only the nodes that should remain in the cluster (excluding the node to be removed).
-
Wait for all MON, MGR, and MDS pods to be running on the remaining nodes before proceeding.
Mark All OSDs Out and Wait for Data Migration
-
Enable the rook-ceph-tools pod if it is not already running.
-
Enter the tools pod.
-
Mark each OSD on the old node as
out. This instructs Ceph to migrate all data off those OSDs onto the remaining OSDs.Repeat for each OSD ID on the node.
-
Monitor rebalancing progress until the cluster returns to
HEALTH_OKwith no misplaced or recovering PGs.Do not proceed until data migration is fully complete. Removing OSDs before migration finishes will result in data loss.
Remove the Old Node's OSDs
-
Edit the CephCluster resource to remove the old node entry.
Locate the old node under
spec.storage.nodesand delete the entire node entry. Save and exit. -
Delete the OSD deployment for each OSD on the old node.
Repeat for each OSD ID on the node.
-
Enter the tools pod and permanently remove each OSD from the cluster.
Inside the tools pod, execute the following for each OSD:
Verify Cluster Health
-
Confirm that all removed OSDs no longer appear in the cluster.
-
Verify that the cluster has returned to a healthy state.
The output should show
HEALTH_OKwith all PGs in theactive+cleanstate.