Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage tiers): add design proposal #268

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
241 changes: 241 additions & 0 deletions doc/storage-tier-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
---
title: Storage Tiering
authors:
- "@croomes"
owners:

creation-date: 2022-07-04
last-updated: 2022-07-04
---

# Storage Tiering

## Table of Contents

- [Storage Tiering](#storage-tiering)
- [Table of Contents](#table-of-contents)
- [Summary](#summary)
- [Problem](#problem)
- [Current Solution](#current-solution)
- [Proposal](#proposal)
- [DiskPool Classification](#diskpool-classification)
- [StorageClass Topology Constraints](#storageclass-topology-constraints)
- [CSI Node Topology](#csi-node-topology)
- [Volume Scheduling](#volume-scheduling)
- [Test Plan](#test-plan)
- [GA Criteria](#ga-criteria)

## Summary

This is a design proposal to support multiple classes of backend
storage within the same Mayastor cluster, aka Storage Tiering.

## Problem

Multiple storage types can be made available for Mayastor to consume: raw block
devices, logical volumes, cloud or network attached devices. Each can have their
own cost, performance, durability and consistency characteristics.

Cluster administrators want to provide their users a choice between different
storage tiers that they assemble from these storage types.

## Current Solution

Mayastor currently has no concept of tiering and treats all `DiskPools` equally.
When choosing where to provision a volume or replica, the `DiskPool` with the
greatest available capacity is chosen.

Topology has been implemented within the control plane for both nodes
and pools.

CSI Node Topology has been implemented, but it is currently only used to ensure
volumes are placed on nodes that run the Mayastor data plane.

For pools, no functionality has been exposed to users, but the scheduler
supports label-based placement constraints.

## Proposal

At a high level:

1. (done) Introduce a label-based classification mechanism on the `DiskPool`
resource.
2. Pass a classification labels from the DiskPool CRI to the internal object.
3. Add topology constraints to the StorageClass.
4. Publish `DiskPool` classification topologies on Kubernetes `Node` and `CSINode`
objects via a new controller.
5. (done) Add a classification-aware scheduling capability to the Mayastor control
plane.

### DiskPool Classification

Classification data can be added to `DiskPools` using labels. The key prefix
should be pre-defined (e.g. `openebs.io/classification`). Keys names are defined
by the cluster administrator, and the value set to `true`. Example:
`openebs.io/classification/premium=true`.

This allows a node to expose multiple `DiskPools` while supporting Kubernetes
standard label selectors.

No changes are needed to the `DiskPool` CRD. This is the same mechanism used to
influence Pod scheduling in Kubernetes.

Example:

```yaml
apiVersion: openebs.io/v1alpha1
kind: DiskPool
metadata:
labels:
openebs.io/classification/premium: "true"
name: diskpool-ksnk5
spec:
disks:
- /dev/sdc
node: node-a
status:
available: 34321989632
capacity: 34321989632
state: Online
used: 0
```

Labels from the `DiskPool` CRI need to be passed to the internal `DiskPool`
object when calling the `put_node_pool` API endpint:

```go
/// Create or import the pool, on failure try again. When we reach max error
/// count we fail the whole thing.
#[tracing::instrument(fields(name = ?self.name(), status = ?self.status) skip(self))]
pub(crate) async fn create_or_import(self) -> Result<ReconcilerAction, Error> {
...

labels.insert(
String::from(utils::CREATED_BY_KEY),
String::from(utils::DSP_OPERATOR),
);

// Add any classification labels to the DiskPool.
for (k, v) in self.metadata.labels.as_ref().unwrap().iter() {
match k.starts_with(utils::CLASSIFICATION_KEY_PREFIX) {
true => {
labels.insert(k.clone(), v.clone());
}
false => {}
}
}
```

### StorageClass Topology Constraints

A StorageClass can be created for each tier. Topology constraints set in the
StorageClass will ensure that only nodes with DiskPools matching the constraint
will be selected for provisioning.

```yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: premium
parameters:
repl: '1'
protocol: 'nvmf'
ioTimeout: '60'
provisioner: io.openebs.csi-mayastor
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: openebs.io/classification/premium
values:
- "true"
```

### CSI Node Topology

CSI drivers normally register the labels that they use for topology. They do
this by returning the label key and any currently supported values in the CSI
`NodeGetInfo` response. For example, they might return:

```protobuf
message NodeGetInfoResponse {
node_id = "csi-node://nodeA";
max_volumes_per_node = 16;
accessible_topology = {"kubernetes.io/hostname": "nodeA", openebs.io/classification/premium: "true"}
}
```

The `NodeGetInfo` request is initiated by kubelet when the
`csi-driver-registrar` container starts and registers itself as a kubelet
plugin. It is only called once. It calls the `NodeGetInfo` endpoint on the
`csi-node` container.

Topology KV pairs returned in the `NodeGetInfoResponse` are added as labels on
the `Node` object:

```yaml
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"disk.csi.azure.com":"aks-storagepool-25014071-vmss000000","io.openebs.csi-mayastor":"csi-node://aks-storagepool-25014071-vmss000000"}'
labels:
openebs.io/engine: mayastor
openebs.io/classification/premium: "true"
name: aks-storagepool-25014071-vmss000000
spec:
...
```

The topology keys are registered against the CSI driver on the `CSINode` object:

```yaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
name: aks-storagepool-25014071-vmss000000
spec:
drivers:
- name: io.openebs.csi-mayastor
nodeID: csi-node://aks-storagepool-25014071-vmss000000
topologyKeys:
- kubernetes.io/hostname
- openebs.io/classification/premium
```

There are two issues with using the CSI driver registration process to manage
tiering topology:

1. Since `NodeGetInfo` is only called once when `csi-driver-registrar` starts,
any changes to node topology (e.g. adding a `DiskPool`) will not be reflected
on the node labels. A mechanism to trigger plugin re-registration is needed.
This could be achieved by adding a livenessProbe on `csi-driver-registrar`
that queries an endpoint that fails if an update is required. This would
cause the `csi-driver-registrar` to restart.
2. The `csi-node` process does not currently have knowledge of the `DiskPools`
exposed on the node so it would need to request them from the control plane's
REST endpoint so it can be returned in the `NodeGetInfo` response. This would
require adding communication from the node to the API, which is not ideal.

Instead, a separate controller is proposed. It will run alongside the diskpool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment we use csi-node topology to control where the diskpools are placed, which as you've pointed out is not quite right (once we remove the locality constraint the csi-node will be able to run on non-"openebs/engine: mayastor" nodes).

What is the actual intent from CSI for the accessible topology?

  // Specifies the list of topologies the provisioned volume MUST be
  // accessible from

Since our volumes are accessed via nvme-tcp, maybe this is not intended to control the placement of data replicas?
And in that case, should we separate the application topology and data topology completely?
If so, we could, for example, pass topology information through the storage class's generic parameters and this way we wouldn't need to sync between diskpools and CSINode?

Copy link

@mittachaitu mittachaitu Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good point to consider, we can have two different topologies one is used for placement of replicas & other one is K8s allowedTopologies(where data can be accessible from).

  • If we have two different topologies for apps & replicas placement then it will serve both use cases:
    • In disaggregated storage use case (where set of nodes in a cluster are dedicated for storage and other nodes for application) by following storageclass.parameters.dataPlacementTopology can provision replicas based on topology and AllowedTopology can be used for accessibility of volume (Where app can schedule).
    • In normal cluster where all the nodes are capable of hosting both storage & apps then one can place replica depends on storageclass.parameters.dataPlacementTopology & application can consume volume from any node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points:

  • once the locality requirement is removed, the volume will be available on any node
  • separate topologies for apps & replica placement

So in effect, we shouldn't re-use CSI placement topology - as long as the replica topology is passed on the CreateVolume call, it can still be stored on the Volume's internal topology and placement works as intended.

This should just mean converting the StorageClass's allowedTopologies to KV pairs in the SC parameters. That should be ok since we don't need any complicated logic.

I'll try this out and update the doc.

Copy link
Contributor Author

@croomes croomes Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out, and I agree this is better:

DiskPool:

apiVersion: openebs.io/v1alpha1
kind: DiskPool
metadata:
  labels:
    openebs.io/classification: premium
  name: diskpool-jr4nm
spec:
  disks:
  - /dev/sdc
  node: aks-storagepool-25014071-vmss000000

StorageClass:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: premium
parameters:
  repl: '1'
  protocol: 'nvmf'
  ioTimeout: '60'
  classification: premium
provisioner: io.openebs.csi-mayastor
volumeBindingMode: WaitForFirstConsumer

A small code change in csi controller's create_volume to pass the parameter in the volume's labelledTopology:

{
    "uuid": "7bb6dbc2-c170-46cd-b627-c0a04d613a24",
    "size": 1073741824,
    "labels": null,
    "num_replicas": 1,
    "status": {
        "Created": "Online"
    },
    "target": {
        "node": "aks-storagepool-25014071-vmss000000",
        "nexus": "7c14d3ff-459e-4aea-967a-c3b4dd47ea12",
        "protocol": "nvmf"
    },
    "policy": {
        "self_heal": true
    },
    "topology": {
        "node": {
            "Explicit": {
                "allowed_nodes": [
                    "aks-nodepool1-14021096-vmss000000",
                    "aks-nodepool1-14021096-vmss000001",
                    "aks-nodepool1-14021096-vmss000002",
                    "aks-storagepool-25014071-vmss000000",
                    "aks-storagepool-25014071-vmss000001",
                    "aks-storagepool-25014071-vmss000002"
                ],
                "preferred_nodes": [
                    "aks-storagepool-25014071-vmss000000",
                    "aks-storagepool-25014071-vmss000001",
                    "aks-storagepool-25014071-vmss000002",
                    "aks-nodepool1-14021096-vmss000000",
                    "aks-nodepool1-14021096-vmss000001",
                    "aks-nodepool1-14021096-vmss000002"
                ]
            }
        },
        "pool": {
            "Labelled": {
                "exclusion": {},
                "inclusion": {
                    "openebs.io/classification": "premium",
                    "openebs.io/created-by": "operator-diskpool"
                }
            }
        }
    },
    "last_nexus_id": "7c14d3ff-459e-4aea-967a-c3b4dd47ea12",
    "operation": null
}

Is this more what you were thinking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great.
Do you see classification as a reserved key word? I think it'd be good to keep it generic so you can use whatever labels you want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it being a reserved key word, but it doesn't need to be. The issue is knowing which SC parameter to copy from the CreateVolume request to the volume.

How about having a default: openebs.io/classification, but have an optional SC param for openebs.io/classification_key that can override it? I think it's fine to have this configurable on the SC. It could also be a flag on the CSI controller but that seems much less flexible.

And since naming is hard... I don't know whether classification is the right term, but I think it would be better to use the label prefix (openebs.io/) everywhere to avoid confusion. We definitely want to use the prefix for the labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good point to consider, we can have two different topologies one is used for placement of replicas & other one is K8s allowedTopologies(where data can be accessible from).

  • If we have two different topologies for apps & replicas placement then it will serve both use cases:

    • In disaggregated storage use case (where set of nodes in a cluster are dedicated for storage and other nodes for application) by following storageclass.parameters.dataPlacementTopology can provision replicas based on topology and AllowedTopology can be used for accessibility of volume (Where app can schedule).
    • In normal cluster where all the nodes are capable of hosting both storage & apps then one can place replica depends on storageclass.parameters.dataPlacementTopology & application can consume volume from any node.

Since we have agreed upon this, we would not be using allowed topologies i.e data-accessibility topology for data-placement. The data-placement topology can be labelled topology incorporated from the storage-class parameters. This also gives us benefit of using pools on newly added nodes to cluster as we allowing scaling up of volume after creation, which would otherwise be a problem if we used the explicit topology filled from allowed topology at the time of creation.

#275 -> PR to seperate, the accessibilty from data-placement.

controller will watch for `DiskPool` changes. If the `DiskPools` on a node have
classification labels and they do not match the `CSINode` topology keys and/or
`Node` labels, then the `CSINode`/`Node` objects will be updated.

There would be no need to alter the plugin registration process or add tiering
topology to the `NodeGetInfo` response as the controller would ensure the end
result is the same.

A feature flag could be used to enable/disable the controller.

### Volume Scheduling

No changes to volume scheduling are expected. Topology requirements are passed
via CSI `CreateVolume` requests and existing placement code is topology-aware.

## Test Plan

Topology-aware placement is already covered but may need extending.

## GA Criteria

TBC