Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFBUGS-978: csi: disable fencing in Rook #794

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -196,8 +196,6 @@ The erasure coded pool must be set as the `dataPool` parameter in

If a node goes down where a pod is running where a RBD RWO volume is mounted, the volume cannot automatically be mounted on another node. The node must be guaranteed to be offline before the volume can be mounted on another node.

!!! Note
These instructions are for clusters with Kubernetes version 1.26 or greater. For K8s 1.25 or older, see the [manual steps in the CSI troubleshooting guide](../../Troubleshooting/ceph-csi-common-issues.md#node-loss) to recover from the node loss.

### Configure CSI-Addons

Expand All @@ -217,6 +215,11 @@ kubectl patch cm rook-ceph-operator-config -n<namespace> -p $'data:\n "CSI_ENABL

### Handling Node Loss

!!! warning
Automated node loss handling is currently disabled, please refer to the [manual steps](../../Troubleshooting/ceph-csi-common-issues.md#node-loss) to recover from the node loss.
We are actively working on a new design for this feature.
For more details see the [tracking issue](https://github.com/rook/rook/issues/14832).

When a node is confirmed to be down, add the following taints to the node:

```console
Expand Down
3 changes: 0 additions & 3 deletions Documentation/Troubleshooting/ceph-csi-common-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,9 +413,6 @@ Where `-m` is one of the mon endpoints and the `--key` is the key used by the CS

When a node is lost, you will see application pods on the node stuck in the `Terminating` state while another pod is rescheduled and is in the `ContainerCreating` state.

!!! important
For clusters with Kubernetes version 1.26 or greater, see the [improved automation](../Storage-Configuration/Block-Storage-RBD/block-storage.md#recover-rbd-rwo-volume-in-case-of-node-loss) to recover from the node loss. If using K8s 1.25 or older, continue with these instructions.

### Force deleting the pod

To force delete the pod stuck in the `Terminating` state:
Expand Down
6 changes: 3 additions & 3 deletions pkg/operator/ceph/cluster/watcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,9 @@ func (c *clientCluster) onK8sNode(ctx context.Context, object runtime.Object) bo
cluster := c.getCephCluster()

// Continue reconcile in case of failure too since we don't want to block other node reconcile
if err := c.handleNodeFailure(ctx, cluster, node); err != nil {
logger.Errorf("failed to handle node failure. %v", err)
}
// if err := c.handleNodeFailure(ctx, cluster, node, opNamespace); err != nil {
// logger.Errorf("failed to handle node failure. %v", err)
// }

// skip reconcile if node is already checked in a previous reconcile
if nodesCheckedForReconcile.Has(node.Name) {
Expand Down
Loading