-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage tiers): add design proposal #268
Open
croomes
wants to merge
1
commit into
openebs:develop
Choose a base branch
from
croomes:storage-tiers-design
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,241 @@ | ||
--- | ||
title: Storage Tiering | ||
authors: | ||
- "@croomes" | ||
owners: | ||
|
||
creation-date: 2022-07-04 | ||
last-updated: 2022-07-04 | ||
--- | ||
|
||
# Storage Tiering | ||
|
||
## Table of Contents | ||
|
||
- [Storage Tiering](#storage-tiering) | ||
- [Table of Contents](#table-of-contents) | ||
- [Summary](#summary) | ||
- [Problem](#problem) | ||
- [Current Solution](#current-solution) | ||
- [Proposal](#proposal) | ||
- [DiskPool Classification](#diskpool-classification) | ||
- [StorageClass Topology Constraints](#storageclass-topology-constraints) | ||
- [CSI Node Topology](#csi-node-topology) | ||
- [Volume Scheduling](#volume-scheduling) | ||
- [Test Plan](#test-plan) | ||
- [GA Criteria](#ga-criteria) | ||
|
||
## Summary | ||
|
||
This is a design proposal to support multiple classes of backend | ||
storage within the same Mayastor cluster, aka Storage Tiering. | ||
|
||
## Problem | ||
|
||
Multiple storage types can be made available for Mayastor to consume: raw block | ||
devices, logical volumes, cloud or network attached devices. Each can have their | ||
own cost, performance, durability and consistency characteristics. | ||
|
||
Cluster administrators want to provide their users a choice between different | ||
storage tiers that they assemble from these storage types. | ||
|
||
## Current Solution | ||
|
||
Mayastor currently has no concept of tiering and treats all `DiskPools` equally. | ||
When choosing where to provision a volume or replica, the `DiskPool` with the | ||
greatest available capacity is chosen. | ||
|
||
Topology has been implemented within the control plane for both nodes | ||
and pools. | ||
|
||
CSI Node Topology has been implemented, but it is currently only used to ensure | ||
volumes are placed on nodes that run the Mayastor data plane. | ||
|
||
For pools, no functionality has been exposed to users, but the scheduler | ||
supports label-based placement constraints. | ||
|
||
## Proposal | ||
|
||
At a high level: | ||
|
||
1. (done) Introduce a label-based classification mechanism on the `DiskPool` | ||
resource. | ||
2. Pass a classification labels from the DiskPool CRI to the internal object. | ||
3. Add topology constraints to the StorageClass. | ||
4. Publish `DiskPool` classification topologies on Kubernetes `Node` and `CSINode` | ||
objects via a new controller. | ||
5. (done) Add a classification-aware scheduling capability to the Mayastor control | ||
plane. | ||
|
||
### DiskPool Classification | ||
|
||
Classification data can be added to `DiskPools` using labels. The key prefix | ||
should be pre-defined (e.g. `openebs.io/classification`). Keys names are defined | ||
by the cluster administrator, and the value set to `true`. Example: | ||
`openebs.io/classification/premium=true`. | ||
|
||
This allows a node to expose multiple `DiskPools` while supporting Kubernetes | ||
standard label selectors. | ||
|
||
No changes are needed to the `DiskPool` CRD. This is the same mechanism used to | ||
influence Pod scheduling in Kubernetes. | ||
|
||
Example: | ||
|
||
```yaml | ||
apiVersion: openebs.io/v1alpha1 | ||
kind: DiskPool | ||
metadata: | ||
labels: | ||
openebs.io/classification/premium: "true" | ||
name: diskpool-ksnk5 | ||
spec: | ||
disks: | ||
- /dev/sdc | ||
node: node-a | ||
status: | ||
available: 34321989632 | ||
capacity: 34321989632 | ||
state: Online | ||
used: 0 | ||
``` | ||
|
||
Labels from the `DiskPool` CRI need to be passed to the internal `DiskPool` | ||
object when calling the `put_node_pool` API endpint: | ||
|
||
```go | ||
/// Create or import the pool, on failure try again. When we reach max error | ||
/// count we fail the whole thing. | ||
#[tracing::instrument(fields(name = ?self.name(), status = ?self.status) skip(self))] | ||
pub(crate) async fn create_or_import(self) -> Result<ReconcilerAction, Error> { | ||
... | ||
|
||
labels.insert( | ||
String::from(utils::CREATED_BY_KEY), | ||
String::from(utils::DSP_OPERATOR), | ||
); | ||
|
||
// Add any classification labels to the DiskPool. | ||
for (k, v) in self.metadata.labels.as_ref().unwrap().iter() { | ||
match k.starts_with(utils::CLASSIFICATION_KEY_PREFIX) { | ||
true => { | ||
labels.insert(k.clone(), v.clone()); | ||
} | ||
false => {} | ||
} | ||
} | ||
``` | ||
|
||
### StorageClass Topology Constraints | ||
|
||
A StorageClass can be created for each tier. Topology constraints set in the | ||
StorageClass will ensure that only nodes with DiskPools matching the constraint | ||
will be selected for provisioning. | ||
|
||
```yaml | ||
kind: StorageClass | ||
apiVersion: storage.k8s.io/v1 | ||
metadata: | ||
name: premium | ||
parameters: | ||
repl: '1' | ||
protocol: 'nvmf' | ||
ioTimeout: '60' | ||
provisioner: io.openebs.csi-mayastor | ||
volumeBindingMode: WaitForFirstConsumer | ||
allowedTopologies: | ||
- matchLabelExpressions: | ||
- key: openebs.io/classification/premium | ||
values: | ||
- "true" | ||
``` | ||
|
||
### CSI Node Topology | ||
|
||
CSI drivers normally register the labels that they use for topology. They do | ||
this by returning the label key and any currently supported values in the CSI | ||
`NodeGetInfo` response. For example, they might return: | ||
|
||
```protobuf | ||
message NodeGetInfoResponse { | ||
node_id = "csi-node://nodeA"; | ||
max_volumes_per_node = 16; | ||
accessible_topology = {"kubernetes.io/hostname": "nodeA", openebs.io/classification/premium: "true"} | ||
} | ||
``` | ||
|
||
The `NodeGetInfo` request is initiated by kubelet when the | ||
`csi-driver-registrar` container starts and registers itself as a kubelet | ||
plugin. It is only called once. It calls the `NodeGetInfo` endpoint on the | ||
`csi-node` container. | ||
|
||
Topology KV pairs returned in the `NodeGetInfoResponse` are added as labels on | ||
the `Node` object: | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Node | ||
metadata: | ||
annotations: | ||
csi.volume.kubernetes.io/nodeid: '{"disk.csi.azure.com":"aks-storagepool-25014071-vmss000000","io.openebs.csi-mayastor":"csi-node://aks-storagepool-25014071-vmss000000"}' | ||
labels: | ||
openebs.io/engine: mayastor | ||
openebs.io/classification/premium: "true" | ||
name: aks-storagepool-25014071-vmss000000 | ||
spec: | ||
... | ||
``` | ||
|
||
The topology keys are registered against the CSI driver on the `CSINode` object: | ||
|
||
```yaml | ||
apiVersion: storage.k8s.io/v1 | ||
kind: CSINode | ||
metadata: | ||
name: aks-storagepool-25014071-vmss000000 | ||
spec: | ||
drivers: | ||
- name: io.openebs.csi-mayastor | ||
nodeID: csi-node://aks-storagepool-25014071-vmss000000 | ||
topologyKeys: | ||
- kubernetes.io/hostname | ||
- openebs.io/classification/premium | ||
``` | ||
|
||
There are two issues with using the CSI driver registration process to manage | ||
tiering topology: | ||
|
||
1. Since `NodeGetInfo` is only called once when `csi-driver-registrar` starts, | ||
any changes to node topology (e.g. adding a `DiskPool`) will not be reflected | ||
on the node labels. A mechanism to trigger plugin re-registration is needed. | ||
This could be achieved by adding a livenessProbe on `csi-driver-registrar` | ||
that queries an endpoint that fails if an update is required. This would | ||
cause the `csi-driver-registrar` to restart. | ||
2. The `csi-node` process does not currently have knowledge of the `DiskPools` | ||
exposed on the node so it would need to request them from the control plane's | ||
REST endpoint so it can be returned in the `NodeGetInfo` response. This would | ||
require adding communication from the node to the API, which is not ideal. | ||
|
||
Instead, a separate controller is proposed. It will run alongside the diskpool | ||
controller will watch for `DiskPool` changes. If the `DiskPools` on a node have | ||
classification labels and they do not match the `CSINode` topology keys and/or | ||
`Node` labels, then the `CSINode`/`Node` objects will be updated. | ||
|
||
There would be no need to alter the plugin registration process or add tiering | ||
topology to the `NodeGetInfo` response as the controller would ensure the end | ||
result is the same. | ||
|
||
A feature flag could be used to enable/disable the controller. | ||
|
||
### Volume Scheduling | ||
|
||
No changes to volume scheduling are expected. Topology requirements are passed | ||
via CSI `CreateVolume` requests and existing placement code is topology-aware. | ||
|
||
## Test Plan | ||
|
||
Topology-aware placement is already covered but may need extending. | ||
|
||
## GA Criteria | ||
|
||
TBC |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment we use csi-node topology to control where the diskpools are placed, which as you've pointed out is not quite right (once we remove the locality constraint the csi-node will be able to run on non-"openebs/engine: mayastor" nodes).
What is the actual intent from CSI for the accessible topology?
Since our volumes are accessed via nvme-tcp, maybe this is not intended to control the placement of data replicas?
And in that case, should we separate the application topology and data topology completely?
If so, we could, for example, pass topology information through the storage class's generic parameters and this way we wouldn't need to sync between
diskpools
andCSINode
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good point to consider, we can have two different topologies one is used for placement of replicas & other one is K8s
allowedTopologies
(where data can be accessible from).storageclass.parameters.dataPlacementTopology
can provision replicas based on topology andAllowedTopology
can be used for accessibility of volume (Where app can schedule).storageclass.parameters.dataPlacementTopology
& application can consume volume from any node.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points:
So in effect, we shouldn't re-use CSI placement topology - as long as the replica topology is passed on the
CreateVolume
call, it can still be stored on the Volume's internal topology and placement works as intended.This should just mean converting the StorageClass's
allowedTopologies
to KV pairs in the SC parameters. That should be ok since we don't need any complicated logic.I'll try this out and update the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it out, and I agree this is better:
DiskPool:
StorageClass:
A small code change in csi controller's
create_volume
to pass the parameter in the volume's labelledTopology:Is this more what you were thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great.
Do you see classification as a reserved key word? I think it'd be good to keep it generic so you can use whatever labels you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it being a reserved key word, but it doesn't need to be. The issue is knowing which SC parameter to copy from the
CreateVolume
request to the volume.How about having a default:
openebs.io/classification
, but have an optional SC param foropenebs.io/classification_key
that can override it? I think it's fine to have this configurable on the SC. It could also be a flag on the CSI controller but that seems much less flexible.And since naming is hard... I don't know whether classification is the right term, but I think it would be better to use the label prefix (openebs.io/) everywhere to avoid confusion. We definitely want to use the prefix for the labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have agreed upon this, we would not be using
allowed topologies
i.e data-accessibility topology for data-placement. The data-placement topology can be labelled topology incorporated from the storage-classparameters
. This also gives us benefit of using pools on newly added nodes to cluster as we allowing scaling up of volume after creation, which would otherwise be a problem if we used the explicit topology filled from allowed topology at the time of creation.#275 -> PR to seperate, the accessibilty from data-placement.