Skip to content
/ r00ki Public

R00ki : Taking Rook Ceph from localhost to Prod

Notifications You must be signed in to change notification settings

deas/r00ki

Repository files navigation


R00ki : Taking Rook Ceph from localhost to Production 🧪


Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. TODO
  4. Known Issues
  5. References
  6. License

About The Project

Kubernetes Storage. Rook. The Boss Fight. Still a bit messy. But it works. Most of the time.

There must be a reason Red Hat OpenShift Data Foundation is expensive ...

Now seriously: Storage is one of the most critical bits in general. Many workloads are stateful, and not every Kubernetes infrastructure solves the problem nicely. That was where I found myself a few times in the past. We we given virtual machines with basic disks attached - VMware VMDKs in my case. Customers were in demand of ... you name it - everything: RWX-/RWO Volumes, S3, Snapshots, Backup/Recovery - superfast and always available. The code reflects these roots.

Disclaimer: We started by borrowing proven things from the Rook project - adapted them as we went along.

Demo creating a Minikube cluster and running a few tests 🪄🎩🐰
 make apply-r00ki-aio test-csi-io test-csi-snapshot test-velero

Demo

Goals

  • Awesome local first Rook Ceph Dev Experience
  • First Class Observability
  • Fail early and loud (Notifications)
  • Simplicity (yes, really)
  • Composability
  • Target minikube, vanilla Kubernetes and Openshift.
  • Add the Rook Ops bits not covered by the Operator
  • Declarative trumps Imperative

Non Goals

Decisions

  • ArgoCD is great, but helmfile appears even better for our use case
  • We aim for first class citizens. For Rook, it's the helm charts, for some operators, its OLM Subscriptions.

Features

We cover:

  • Single (All in Once Cluster) Deployments targetting minikube and Production Kubernetes (including Openshift)
  • Two Cluster Deployments (Service and Consumer) targetting minikube and Production Kubernetes (including Openshift)
  • Kube-Prometheus bits all wired up - including alerts
  • Shiny Dashboards (including Grafana)
  • Seamless integration with ArgoCD, specifically deas/argcocd-conductor

(back to top)

Getting Started

Some opinions first:

  • Ceph is complex
  • Automating Trust Relationships is hard

Prerequisites

  • make
  • minikube
  • kubectl
  • helmfile

Usage

Run

make

shows help for basic tasks and give you an idea where to start.

We want lifecycle of things (Create/Destroy) to be as fast as possible. We ship support to levarage registry mirrors using pull through.

(back to top)

TODO

  • Use dyff to separate out value files?
  • Separate out Observability, add Logging and Alerting
  • Support for Mon v2
  • Support for TLS/encryption
  • Replace imperative bits by declarative ones
  • Introduce Pentesting - maybe even Chaos Scenarios
  • Improve Observability / Include Alerts
  • Smoketests in CI
  • Cleanup bits aroud TODO tags sprinkled across the code
  • Use LVM instead of raw disks/partitions?
  • Performance: How/When do multiple disks per node make sense?
  • Exercise Upgrade/Recreate and Desaster Recovery + build tests
  • Introduce unhappy path tests -likely leveraging Litmus
  • Proper cascaded removal of CephCluster?
  • Finding-/cleaning up orphans (volumes or buckets)
  • Go deeper with nix/devenv - maybe even replace mise

(back to top)

Known Issues

  • "To sum up: the Docker daemon does not currently support multiple registry mirrors ..." -> minikube start --registry-mirror="http://yourmirror"
  • kvm network dns(masq) slow from minikube kubernetes. Times out for s3. Patching coredns gets around the issue.
  • mons on port 3300 (workaround: use port 6789 / ROOK_EXTERNAL_CEPH_MON_DATA): 2024-12-16T16:56:02.784+0000 7fd593d1c000 -1 failed for service _ceph-mon._tcp mount error: no mds (Metadata Server) is up. The cluster might be laggy, or you may not be authorized Warning FailedMount 2m25s kubelet (combined from similar events): MountVolume.MountDevice failed for volume "pvc-026c86e8-9ee4-4261-a7e4-083011b80494" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 192.168.122.231:3300:/volumes/csi/csi-vol-7072e90c-5d6b-477b-bbab-655b76d0425f/e8d828a3-a1ad-4a22-9b36-7d5bc9fe9026 /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com/f172f41f387d01c38f46e71a4097304d70c35494e81e1c8a070549de56234790/globalmount -o name=csi-cephfs-node,secretfile=/tmp/csi/keys/keyfile-2436134297,mds_namespace=myfs,_netdev] stderr: unable to get monitor info from DNS SRV with service name: ceph-mon
  • Looking up Monitors through DNS
  • OperatorHub Sub Outdated - at 1.1.1

References

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

About

R00ki : Taking Rook Ceph from localhost to Prod

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published