搭建 Rook Ceph #350

Bpazy · 2025-01-03T05:35:37Z

k8s 集群有个问题，有状态应用的存储怎么处理？

如果用 NFS，sqlite 的锁无法正常支持，因为 NFS 没有实现标准 POSIX，可以看这个详细解释: https://stackoverflow.com/questions/38673701/new-sqlite3-database-is-locked
如果用 local path，那 pod 就不能在集群能漂移；

所以这里考虑搭建 Ceph 集群，来解决上述问题。其 CephFS 实现了标准 POSIX。而 Rook-Ceph 又是利用 k8s 简化了 Ceph 的搭建，所以这里直接用 Rook-Ceph。

Bpazy · 2025-01-08T08:57:56Z

1. 先安装 rook ceph operator

$ helm repo add rook-release https://charts.rook.io/release
$ helm upgrade --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph -f charts/rook-values.yaml

charts/rook-values.yaml 文件内容如下（都是默认值）：

点击展开

# Default values for rook-ceph-operator
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

image:
  # -- Image
  repository: docker.io/rook/ceph
  # -- Image tag
  # @default -- `master`
  tag: v1.16.1
  # -- Image pull policy
  pullPolicy: IfNotPresent

crds:
  # -- Whether the helm chart should create and update the CRDs. If false, the CRDs must be
  # managed independently with deploy/examples/crds.yaml.
  # **WARNING** Only set during first deployment. If later disabled the cluster may be DESTROYED.
  # If the CRDs are deleted in this case, see
  # [the disaster recovery guide](https://rook.io/docs/rook/latest/Troubleshooting/disaster-recovery/#restoring-crds-after-deletion)
  # to restore them.
  enabled: true

# -- Pod resource requests & limits
resources:
  limits:
    memory: 512Mi
  requests:
    cpu: 200m
    memory: 128Mi

# -- Kubernetes [`nodeSelector`](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) to add to the Deployment.
nodeSelector: 
# Constraint rook-ceph-operator Deployment to nodes with label `disktype: ssd`.
# For more info, see https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector
#  disktype: ssd

# -- List of Kubernetes [`tolerations`](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) to add to the Deployment.
tolerations:

# -- Delay to use for the `node.kubernetes.io/unreachable` pod failure toleration to override
# the Kubernetes default of 5 minutes
unreachableNodeTolerationSeconds: 5

# -- Whether the operator should watch cluster CRD in its own namespace or not
currentNamespaceOnly: false

# -- Pod annotations
annotations: {}

# -- Global log level for the operator.
# Options: `ERROR`, `WARNING`, `INFO`, `DEBUG`
logLevel: INFO

# -- If true, create & use RBAC resources
rbacEnable: true

rbacAggregate:
  # -- If true, create a ClusterRole aggregated to [user facing roles](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#user-facing-roles) for objectbucketclaims
  enableOBCs: false

# -- If true, create & use PSP resources
pspEnable: false

# -- Set the priority class for the rook operator deployment if desired
priorityClassName:

# -- Set the container security context for the operator
containerSecurityContext:
  runAsNonRoot: true
  runAsUser: 2016
  runAsGroup: 2016
  capabilities:
    drop: ["ALL"]
# -- If true, loop devices are allowed to be used for osds in test clusters
allowLoopDevices: false

# Settings for whether to disable the drivers or other daemons if they are not
# needed
csi:
  # -- Enable Ceph CSI RBD driver
  enableRbdDriver: true
  # -- Enable Ceph CSI CephFS driver
  enableCephfsDriver: true
  # -- Disable the CSI driver.
  disableCsiDriver: "false"

  # -- Enable host networking for CSI CephFS and RBD nodeplugins. This may be necessary
  # in some network configurations where the SDN does not provide access to an external cluster or
  # there is significant drop in read/write performance
  enableCSIHostNetwork: true
  # -- Enable Snapshotter in CephFS provisioner pod
  enableCephfsSnapshotter: true
  # -- Enable Snapshotter in NFS provisioner pod
  enableNFSSnapshotter: true
  # -- Enable Snapshotter in RBD provisioner pod
  enableRBDSnapshotter: true
  # -- Enable Host mount for `/etc/selinux` directory for Ceph CSI nodeplugins
  enablePluginSelinuxHostMount: false
  # -- Enable Ceph CSI PVC encryption support
  enableCSIEncryption: false

  # -- Enable volume group snapshot feature. This feature is
  # enabled by default as long as the necessary CRDs are available in the cluster.
  enableVolumeGroupSnapshot: true
  # -- PriorityClassName to be set on csi driver plugin pods
  pluginPriorityClassName: system-node-critical

  # -- PriorityClassName to be set on csi driver provisioner pods
  provisionerPriorityClassName: system-cluster-critical

  # -- Policy for modifying a volume's ownership or permissions when the RBD PVC is being mounted.
  # supported values are documented at https://kubernetes-csi.github.io/docs/support-fsgroup.html
  rbdFSGroupPolicy: "File"

  # -- Policy for modifying a volume's ownership or permissions when the CephFS PVC is being mounted.
  # supported values are documented at https://kubernetes-csi.github.io/docs/support-fsgroup.html
  cephFSFSGroupPolicy: "File"

  # -- Policy for modifying a volume's ownership or permissions when the NFS PVC is being mounted.
  # supported values are documented at https://kubernetes-csi.github.io/docs/support-fsgroup.html
  nfsFSGroupPolicy: "File"

  # -- OMAP generator generates the omap mapping between the PV name and the RBD image
  # which helps CSI to identify the rbd images for CSI operations.
  # `CSI_ENABLE_OMAP_GENERATOR` needs to be enabled when we are using rbd mirroring feature.
  # By default OMAP generator is disabled and when enabled, it will be deployed as a
  # sidecar with CSI provisioner pod, to enable set it to true.
  enableOMAPGenerator: false

  # -- Set CephFS Kernel mount options to use https://docs.ceph.com/en/latest/man/8/mount.ceph/#options.
  # Set to "ms_mode=secure" when connections.encrypted is enabled in CephCluster CR
  cephFSKernelMountOptions:

  # -- Enable adding volume metadata on the CephFS subvolumes and RBD images.
  # Not all users might be interested in getting volume/snapshot details as metadata on CephFS subvolume and RBD images.
  # Hence enable metadata is false by default
  enableMetadata: false

  # -- Set replicas for csi provisioner deployment
  provisionerReplicas: 2

  # -- Cluster name identifier to set as metadata on the CephFS subvolume and RBD images. This will be useful
  # in cases like for example, when two container orchestrator clusters (Kubernetes/OCP) are using a single ceph cluster
  clusterName:

  # -- Set logging level for cephCSI containers maintained by the cephCSI.
  # Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity.
  logLevel: 0

  # -- Set logging level for Kubernetes-csi sidecar containers.
  # Supported values from 0 to 5. 0 for general useful logs (the default), 5 for trace level verbosity.
  # @default -- `0`
  sidecarLogLevel:

  # -- CSI driver name prefix for cephfs, rbd and nfs.
  # @default -- `namespace name where rook-ceph operator is deployed`
  csiDriverNamePrefix:

  # -- CSI RBD plugin daemonset update strategy, supported values are OnDelete and RollingUpdate
  # @default -- `RollingUpdate`
  rbdPluginUpdateStrategy:

  # -- A maxUnavailable parameter of CSI RBD plugin daemonset update strategy.
  # @default -- `1`
  rbdPluginUpdateStrategyMaxUnavailable:

  # -- CSI CephFS plugin daemonset update strategy, supported values are OnDelete and RollingUpdate
  # @default -- `RollingUpdate`
  cephFSPluginUpdateStrategy:

  # -- A maxUnavailable parameter of CSI cephFS plugin daemonset update strategy.
  # @default -- `1`
  cephFSPluginUpdateStrategyMaxUnavailable:

  # -- CSI NFS plugin daemonset update strategy, supported values are OnDelete and RollingUpdate
  # @default -- `RollingUpdate`
  nfsPluginUpdateStrategy:

  # -- Set GRPC timeout for csi containers (in seconds). It should be >= 120. If this value is not set or is invalid, it defaults to 150
  grpcTimeoutInSeconds: 150

  # -- Burst to use while communicating with the kubernetes apiserver.
  kubeApiBurst:

  # -- QPS to use while communicating with the kubernetes apiserver.
  kubeApiQPS:

  # -- The volume of the CephCSI RBD plugin DaemonSet
  csiRBDPluginVolume:
  #  - name: lib-modules
  #    hostPath:
  #      path: /run/booted-system/kernel-modules/lib/modules/
  #  - name: host-nix
  #    hostPath:
  #      path: /nix

  # -- The volume mounts of the CephCSI RBD plugin DaemonSet
  csiRBDPluginVolumeMount:
  #  - name: host-nix
  #    mountPath: /nix
  #    readOnly: true

  # -- The volume of the CephCSI CephFS plugin DaemonSet
  csiCephFSPluginVolume:
  #  - name: lib-modules
  #    hostPath:
  #      path: /run/booted-system/kernel-modules/lib/modules/
  #  - name: host-nix
  #    hostPath:
  #      path: /nix

  # -- The volume mounts of the CephCSI CephFS plugin DaemonSet
  csiCephFSPluginVolumeMount:
  #  - name: host-nix
  #    mountPath: /nix
  #    readOnly: true

  # -- CEPH CSI RBD provisioner resource requirement list
  # csi-omap-generator resources will be applied only if `enableOMAPGenerator` is set to `true`
  # @default -- see values.yaml
  csiRBDProvisionerResource: |
    - name : csi-provisioner
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-resizer
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-attacher
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-snapshotter
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-rbdplugin
      resource:
        requests:
          memory: 512Mi
        limits:
          memory: 1Gi
    - name : csi-omap-generator
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi
    - name : liveness-prometheus
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi

  # -- CEPH CSI RBD plugin resource requirement list
  # @default -- see values.yaml
  csiRBDPluginResource: |
    - name : driver-registrar
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi
    - name : csi-rbdplugin
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi
    - name : liveness-prometheus
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi

  # -- CEPH CSI CephFS provisioner resource requirement list
  # @default -- see values.yaml
  csiCephFSProvisionerResource: |
    - name : csi-provisioner
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-resizer
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-attacher
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-snapshotter
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-cephfsplugin
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi
    - name : liveness-prometheus
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi

  # -- CEPH CSI CephFS plugin resource requirement list
  # @default -- see values.yaml
  csiCephFSPluginResource: |
    - name : driver-registrar
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi
    - name : csi-cephfsplugin
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi
    - name : liveness-prometheus
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi

  # -- CEPH CSI NFS provisioner resource requirement list
  # @default -- see values.yaml
  csiNFSProvisionerResource: |
    - name : csi-provisioner
      resource:
        requests:
          memory: 128Mi
          cpu: 100m
        limits:
          memory: 256Mi
    - name : csi-nfsplugin
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi
    - name : csi-attacher
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi

  # -- CEPH CSI NFS plugin resource requirement list
  # @default -- see values.yaml
  csiNFSPluginResource: |
    - name : driver-registrar
      resource:
        requests:
          memory: 128Mi
          cpu: 50m
        limits:
          memory: 256Mi
    - name : csi-nfsplugin
      resource:
        requests:
          memory: 512Mi
          cpu: 250m
        limits:
          memory: 1Gi

  # Set provisionerTolerations and provisionerNodeAffinity for provisioner pod.
  # The CSI provisioner would be best to start on the same nodes as other ceph daemons.

  # -- Array of tolerations in YAML format which will be added to CSI provisioner deployment
  provisionerTolerations:
  #    - key: key
  #      operator: Exists
  #      effect: NoSchedule

  # -- The node labels for affinity of the CSI provisioner deployment [^1]
  provisionerNodeAffinity: #key1=value1,value2; key2=value3
    # requiredDuringSchedulingIgnoredDuringExecution:
    #   nodeSelectorTerms:
    #     - matchExpressions:
    #         - key: role
    #           operator: In
    #           values:
    #             - ceph
  # Set pluginTolerations and pluginNodeAffinity for plugin daemonset pods.
  # The CSI plugins need to be started on all the nodes where the clients need to mount the storage.

  # -- Array of tolerations in YAML format which will be added to CephCSI plugin DaemonSet
  pluginTolerations:
  #    - key: key
  #      operator: Exists
  #      effect: NoSchedule

  # -- The node labels for affinity of the CephCSI RBD plugin DaemonSet [^1]
  pluginNodeAffinity: # key1=value1,value2; key2=value3
    # requiredDuringSchedulingIgnoredDuringExecution:
    #   nodeSelectorTerms:
    #     - matchExpressions:
    #         - key: role
    #           operator: In
    #           values:
    #             - ceph
  # -- Enable Ceph CSI Liveness sidecar deployment
  enableLiveness: false

  # -- CSI CephFS driver metrics port
  # @default -- `9081`
  cephfsLivenessMetricsPort:

  # -- CSI Addons server port
  # @default -- `9070`
  csiAddonsPort:

  # -- Enable Ceph Kernel clients on kernel < 4.17. If your kernel does not support quotas for CephFS
  # you may want to disable this setting. However, this will cause an issue during upgrades
  # with the FUSE client. See the [upgrade guide](https://rook.io/docs/rook/v1.2/ceph-upgrade.html)
  forceCephFSKernelClient: true

  # -- Ceph CSI RBD driver metrics port
  # @default -- `8080`
  rbdLivenessMetricsPort:

  serviceMonitor:
    # -- Enable ServiceMonitor for Ceph CSI drivers
    enabled: false
    # -- Service monitor scrape interval
    interval: 10s
    # -- ServiceMonitor additional labels
    labels: {}
    # -- Use a different namespace for the ServiceMonitor
    namespace:

  # -- Kubelet root directory path (if the Kubelet uses a different path for the `--root-dir` flag)
  # @default -- `/var/lib/kubelet`
  kubeletDirPath:

  # -- Duration in seconds that non-leader candidates will wait to force acquire leadership.
  # @default -- `137s`
  csiLeaderElectionLeaseDuration:

  # -- Deadline in seconds that the acting leader will retry refreshing leadership before giving up.
  # @default -- `107s`
  csiLeaderElectionRenewDeadline:

  # -- Retry period in seconds the LeaderElector clients should wait between tries of actions.
  # @default -- `26s`
  csiLeaderElectionRetryPeriod:

  cephcsi:
    # -- Ceph CSI image repository
    repository: quay.io/cephcsi/cephcsi
    # -- Ceph CSI image tag
    tag: v3.13.0

  registrar:
    # -- Kubernetes CSI registrar image repository
    repository: registry.k8s.io/sig-storage/csi-node-driver-registrar
    # -- Registrar image tag
    tag: v2.11.1

  provisioner:
    # -- Kubernetes CSI provisioner image repository
    repository: registry.k8s.io/sig-storage/csi-provisioner
    # -- Provisioner image tag
    tag: v5.0.1

  snapshotter:
    # -- Kubernetes CSI snapshotter image repository
    repository: registry.k8s.io/sig-storage/csi-snapshotter
    # -- Snapshotter image tag
    tag: v8.2.0

  attacher:
    # -- Kubernetes CSI Attacher image repository
    repository: registry.k8s.io/sig-storage/csi-attacher
    # -- Attacher image tag
    tag: v4.6.1

  resizer:
    # -- Kubernetes CSI resizer image repository
    repository: registry.k8s.io/sig-storage/csi-resizer
    # -- Resizer image tag
    tag: v1.11.1

  # -- Image pull policy
  imagePullPolicy: IfNotPresent

  # -- Labels to add to the CSI CephFS Deployments and DaemonSets Pods
  cephfsPodLabels: #"key1=value1,key2=value2"

  # -- Labels to add to the CSI NFS Deployments and DaemonSets Pods
  nfsPodLabels: #"key1=value1,key2=value2"

  # -- Labels to add to the CSI RBD Deployments and DaemonSets Pods
  rbdPodLabels: #"key1=value1,key2=value2"

  csiAddons:
    # -- Enable CSIAddons
    enabled: false
    # -- CSIAddons sidecar image repository
    repository: quay.io/csiaddons/k8s-sidecar
    # -- CSIAddons sidecar image tag
    tag: v0.11.0

  nfs:
    # -- Enable the nfs csi driver
    enabled: false

  topology:
    # -- Enable topology based provisioning
    enabled: false
    # NOTE: the value here serves as an example and needs to be
    # updated with node labels that define domains of interest
    # -- domainLabels define which node labels to use as domains
    # for CSI nodeplugins to advertise their domains
    domainLabels:
    # - kubernetes.io/hostname
    # - topology.kubernetes.io/zone
    # - topology.rook.io/rack

  # -- Whether to skip any attach operation altogether for CephFS PVCs. See more details
  # [here](https://kubernetes-csi.github.io/docs/skip-attach.html#skip-attach-with-csi-driver-object).
  # If cephFSAttachRequired is set to false it skips the volume attachments and makes the creation
  # of pods using the CephFS PVC fast. **WARNING** It's highly discouraged to use this for
  # CephFS RWO volumes. Refer to this [issue](https://github.com/kubernetes/kubernetes/issues/103305) for more details.
  cephFSAttachRequired: true
  # -- Whether to skip any attach operation altogether for RBD PVCs. See more details
  # [here](https://kubernetes-csi.github.io/docs/skip-attach.html#skip-attach-with-csi-driver-object).
  # If set to false it skips the volume attachments and makes the creation of pods using the RBD PVC fast.
  # **WARNING** It's highly discouraged to use this for RWO volumes as it can cause data corruption.
  # csi-addons operations like Reclaimspace and PVC Keyrotation will also not be supported if set
  # to false since we'll have no VolumeAttachments to determine which node the PVC is mounted on.
  # Refer to this [issue](https://github.com/kubernetes/kubernetes/issues/103305) for more details.
  rbdAttachRequired: true
  # -- Whether to skip any attach operation altogether for NFS PVCs. See more details
  # [here](https://kubernetes-csi.github.io/docs/skip-attach.html#skip-attach-with-csi-driver-object).
  # If cephFSAttachRequired is set to false it skips the volume attachments and makes the creation
  # of pods using the NFS PVC fast. **WARNING** It's highly discouraged to use this for
  # NFS RWO volumes. Refer to this [issue](https://github.com/kubernetes/kubernetes/issues/103305) for more details.
  nfsAttachRequired: true

# -- Enable discovery daemon
enableDiscoveryDaemon: false
# -- Set the discovery daemon device discovery interval (default to 60m)
discoveryDaemonInterval: 60m

# -- The timeout for ceph commands in seconds
cephCommandsTimeoutSeconds: "15"

# -- If true, run rook operator on the host network
useOperatorHostNetwork:

# -- If true, scale down the rook operator.
# This is useful for administrative actions where the rook operator must be scaled down, while using gitops style tooling
# to deploy your helm charts.
scaleDownOperator: false

## Rook Discover configuration
## toleration: NoSchedule, PreferNoSchedule or NoExecute
## tolerationKey: Set this to the specific key of the taint to tolerate
## tolerations: Array of tolerations in YAML format which will be added to agent deployment
## nodeAffinity: Set to labels of the node to match

discover:
  # -- Toleration for the discover pods.
  # Options: `NoSchedule`, `PreferNoSchedule` or `NoExecute`
  toleration:
  # -- The specific key of the taint to tolerate
  tolerationKey:
  # -- Array of tolerations in YAML format which will be added to discover deployment
  tolerations:
  #   - key: key
  #     operator: Exists
  #     effect: NoSchedule
  # -- The node labels for affinity of `discover-agent` [^1]
  nodeAffinity:
  #   key1=value1,value2; key2=value3
  #
  #   or
  #
  #   requiredDuringSchedulingIgnoredDuringExecution:
  #     nodeSelectorTerms:
  #       - matchExpressions:
  #           - key: storage-node
  #             operator: Exists
  # -- Labels to add to the discover pods
  podLabels: # "key1=value1,key2=value2"
  # -- Add resources to discover daemon pods
  resources:
  #   - limits:
  #       memory: 512Mi
  #   - requests:
  #       cpu: 100m
  #       memory: 128Mi

# -- Runs Ceph Pods as privileged to be able to write to `hostPaths` in OpenShift with SELinux restrictions.
hostpathRequiresPrivileged: false

# -- Whether to create all Rook pods to run on the host network, for example in environments where a CNI is not enabled
enforceHostNetwork: false

# -- Disable automatic orchestration when new devices are discovered.
disableDeviceHotplug: false

# -- The revision history limit for all pods created by Rook. If blank, the K8s default is 10.
revisionHistoryLimit:

# -- Blacklist certain disks according to the regex provided.
discoverDaemonUdev:

# -- imagePullSecrets option allow to pull docker images from private docker registry. Option will be passed to all service accounts.
imagePullSecrets:
# - name: my-registry-secret

# -- Whether the OBC provisioner should watch on the operator namespace or not, if not the namespace of the cluster will be used
enableOBCWatchOperatorNamespace: true

# -- Specify the prefix for the OBC provisioner in place of the cluster namespace
# @default -- `ceph cluster namespace`
obcProvisionerNamePrefix:

monitoring:
  # -- Enable monitoring. Requires Prometheus to be pre-installed.
  # Enabling will also create RBAC rules to allow Operator to create ServiceMonitors
  enabled: false

验证安装结果:

ziyuan@pve-gmk-ubuntu:~/k8s$ kubectl get pod -n rook-ceph
NAME                                                     READY   STATUS      RESTARTS        AGE
rook-ceph-operator-68d9f8b984-w7tmj                      1/1     Running     0               18m

2. 配置 rook ceph 集群

有了 operator，那我们就可以利用 k8s CRD 完成集群的配置了。

$ kubectl apply -f cluster.yaml

cluster.yaml 文件内容如下，我这里只调整了 devices 配置，手动指定节点和节点的磁盘：

点击展开

#################################################################################################################
# Define the settings for the rook-ceph cluster with common settings for a production cluster.
# All nodes with available raw devices will be used for the Ceph cluster. At least three nodes are required
# in this example. See the documentation for more details on storage settings available.

# For example, to create the cluster:
#   kubectl create -f crds.yaml -f common.yaml -f operator.yaml
#   kubectl create -f cluster.yaml
#################################################################################################################

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph # namespace:cluster
spec:
  cephVersion:
    # The container image used to launch the Ceph daemon pods (mon, mgr, osd, mds, rgw).
    # v18 is Reef, v19 is Squid
    # RECOMMENDATION: In production, use a specific version tag instead of the general v19 flag, which pulls the latest release and could result in different
    # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
    # If you want to be more precise, you can always use a timestamp tag such as quay.io/ceph/ceph:v19.2.0-20240927
    # This tag might not contain a new Ceph version, just security fixes from the underlying operating system, which will reduce vulnerabilities
    image: quay.io/ceph/ceph:v19.2.0
    # Whether to allow unsupported versions of Ceph. Currently Reef and Squid are supported.
    # Future versions such as Tentacle (v20) would require this to be set to `true`.
    # Do not set to true in production.
    allowUnsupported: false
  # The path on the host where configuration files will be persisted. Must be specified. If there are multiple clusters, the directory must be unique for each cluster.
  # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
  # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
  dataDirHostPath: /var/lib/rook
  # Whether or not upgrade should continue even if a check fails
  # This means Ceph's status could be degraded and we don't recommend upgrading but you might decide otherwise
  # Use at your OWN risk
  # To understand Rook's upgrade process of Ceph, read https://rook.io/docs/rook/latest/ceph-upgrade.html#ceph-version-upgrades
  skipUpgradeChecks: false
  # Whether or not continue if PGs are not clean during an upgrade
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  # WaitTimeoutForHealthyOSDInMinutes defines the time (in minutes) the operator would wait before an OSD can be stopped for upgrade or restart.
  # If the timeout exceeds and OSD is not ok to stop, then the operator would skip upgrade for the current OSD and proceed with the next one
  # if `continueUpgradeAfterChecksEvenIfNotHealthy` is `false`. If `continueUpgradeAfterChecksEvenIfNotHealthy` is `true`, then operator would
  # continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
  # The default wait timeout is 10 minutes.
  waitTimeoutForHealthyOSDInMinutes: 10
  # Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy.
  # This configuration will be ignored if `skipUpgradeChecks` is `true`.
  # Default is false.
  upgradeOSDRequiresHealthyPGs: false
  mon:
    # Set the number of mons to be started. Generally recommended to be 3.
    # For highest availability, an odd number of mons should be specified.
    count: 3
    # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
    # Mons should only be allowed on the same node for test environments where data loss is acceptable.
    allowMultiplePerNode: false
  mgr:
    # When higher availability of the mgr is needed, increase the count to 2.
    # In that case, one mgr will be active and one in standby. When Ceph updates which
    # mgr is active, Rook will update the mgr services to match the active mgr.
    count: 2
    allowMultiplePerNode: false
    modules:
      # List of modules to optionally enable or disable.
      # Note the "dashboard" and "monitoring" modules are already configured by other settings in the cluster CR.
      - name: rook
        enabled: true
  # enable the ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
    # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
    # urlPrefix: /ceph-dashboard
    # serve the dashboard at the given port.
    # port: 8443
    # serve the dashboard using SSL
    ssl: false
    # The url of the Prometheus instance
    # prometheusEndpoint: <protocol>://<prometheus-host>:<port>
    # Whether SSL should be verified if the Prometheus server is using https
    # prometheusEndpointSSLVerify: false
  # enable prometheus alerting for cluster
  monitoring:
    # requires Prometheus to be pre-installed
    enabled: false
    # Whether to disable the metrics reported by Ceph. If false, the prometheus mgr module and Ceph exporter are enabled.
    # If true, the prometheus mgr module and Ceph exporter are both disabled. Default is false.
    metricsDisabled: false
    # Ceph exporter metrics config.
    exporter:
      # Specifies which performance counters are exported.
      # Corresponds to --prio-limit Ceph exporter flag
      # 0 - all counters are exported
      perfCountersPrioLimit: 5
      # Time to wait before sending requests again to exporter server (seconds)
      # Corresponds to --stats-period Ceph exporter flag
      statsPeriodSeconds: 5
  network:
    connections:
      # Whether to encrypt the data in transit across the wire to prevent eavesdropping the data on the network.
      # The default is false. When encryption is enabled, all communication between clients and Ceph daemons, or between Ceph daemons will be encrypted.
      # When encryption is not enabled, clients still establish a strong initial authentication and data integrity is still validated with a crc check.
      # IMPORTANT: Encryption requires the 5.11 kernel for the latest nbd and cephfs drivers. Alternatively for testing only,
      # you can set the "mounter: rbd-nbd" in the rbd storage class, or "mounter: fuse" in the cephfs storage class.
      # The nbd and fuse drivers are *not* recommended in production since restarting the csi driver pod will disconnect the volumes.
      encryption:
        enabled: false
      # Whether to compress the data in transit across the wire. The default is false.
      # See the kernel requirements above for encryption.
      compression:
        enabled: false
      # Whether to require communication over msgr2. If true, the msgr v1 port (6789) will be disabled
      # and clients will be required to connect to the Ceph cluster with the v2 port (3300).
      # Requires a kernel that supports msgr v2 (kernel 5.11 or CentOS 8.4 or newer).
      requireMsgr2: false
    # enable host networking
    #provider: host
    # enable the Multus network provider
    #provider: multus
    #selectors:
    #  The selector keys are required to be `public` and `cluster`.
    #  Based on the configuration, the operator will do the following:
    #    1. if only the `public` selector key is specified both public_network and cluster_network Ceph settings will listen on that interface
    #    2. if both `public` and `cluster` selector keys are specified the first one will point to 'public_network' flag and the second one to 'cluster_network'
    #
    #  In order to work, each selector value must match a NetworkAttachmentDefinition object in Multus
    #
    #  public: public-conf --> NetworkAttachmentDefinition object name in Multus
    #  cluster: cluster-conf --> NetworkAttachmentDefinition object name in Multus
    # Provide internet protocol version. IPv6, IPv4 or empty string are valid options. Empty string would mean IPv4
    #ipFamily: "IPv6"
    # Ceph daemons to listen on both IPv4 and Ipv6 networks
    #dualStack: false
    # Enable multiClusterService to export the mon and OSD services to peer cluster.
    # This is useful to support RBD mirroring between two clusters having overlapping CIDRs.
    # Ensure that peer clusters are connected using an MCS API compatible application, like Globalnet Submariner.
    #multiClusterService:
    #  enabled: false

  # enable the crash collector for ceph daemon crash collection
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    #daysToRetain: 30
  # enable log collector, daemons will log on files and rotate
  logCollector:
    enabled: true
    periodicity: daily # one of: hourly, daily, weekly, monthly
    maxLogSize: 500M # SUFFIX may be 'M' or 'G'. Must be at least 1M.
  # automate [data cleanup process](https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/ceph-teardown.md#delete-the-data-on-hosts) in cluster destruction.
  cleanupPolicy:
    # Since cluster cleanup is destructive to data, confirmation is required.
    # To destroy all Rook data on hosts during uninstall, confirmation must be set to "yes-really-destroy-data".
    # This value should only be set when the cluster is about to be deleted. After the confirmation is set,
    # Rook will immediately stop configuring the cluster and only wait for the delete command.
    # If the empty string is set, Rook will not destroy any data on hosts during uninstall.
    confirmation: ""
    # sanitizeDisks represents settings for sanitizing OSD disks on cluster deletion
    sanitizeDisks:
      # method indicates if the entire disk should be sanitized or simply ceph's metadata
      # in both case, re-install is possible
      # possible choices are 'complete' or 'quick' (default)
      method: quick
      # dataSource indicate where to get random bytes from to write on the disk
      # possible choices are 'zero' (default) or 'random'
      # using random sources will consume entropy from the system and will take much more time then the zero source
      dataSource: zero
      # iteration overwrite N times instead of the default (1)
      # takes an integer value
      iteration: 1
    # allowUninstallWithVolumes defines how the uninstall should be performed
    # If set to true, cephCluster deletion does not wait for the PVs to be deleted.
    allowUninstallWithVolumes: false
  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # tolerate taints with a key of 'storage-node'.
  placement:
    # all:
    #   nodeAffinity:
    #     requiredDuringSchedulingIgnoredDuringExecution:
    #       nodeSelectorTerms:
    #         - matchExpressions:
    #             - key: role
    #               operator: In
    #               values:
    #                 - ceph
      podAffinity:
      podAntiAffinity:
      topologySpreadConstraints:
      tolerations:
  # The above placement information can also be specified for mon, osd, and mgr components
  #   mon:
  # Monitor deployments may contain an anti-affinity rule for avoiding monitor
  # collocation on the same node. This is a required rule when host network is used
  # or when AllowMultiplePerNode is false. Otherwise this anti-affinity rule is a
  # preferred rule with weight: 50.
  #   osd:
  #    prepareosd:
  #    mgr:
  #    cleanup:
  annotations:
  #   all:
  #   mon:
  #   mgr:
  #   osd:
  #   exporter:
  #   crashcollector:
  #   cleanup:
  #   prepareosd:
  # cmdreporter is for jobs to detect ceph and csi versions, and check network status
  #   cmdreporter:
  # clusterMetadata annotations will be applied to only `rook-ceph-mon-endpoints` configmap and the `rook-ceph-mon` and `rook-ceph-admin-keyring` secrets.
  # And clusterMetadata annotations will not be merged with `all` annotations.
  #    clusterMetadata:
  #       kubed.appscode.com/sync: "true"
  # If no mgr annotations are set, prometheus scrape annotations will be set by default.
  #   mgr:
  labels:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   mgr:
  #   prepareosd:
  # These labels are applied to ceph-exporter servicemonitor only
  #   exporter:
  # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
  # These labels can be passed as LabelSelector to Prometheus
  #   monitoring:
  #   crashcollector:
  resources:
  #The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
  #   mgr:
  #     limits:
  #       memory: "1024Mi"
  #     requests:
  #       cpu: "500m"
  #       memory: "1024Mi"
  # The above example requests/limits can also be added to the other components
  #   mon:
  #   osd:
  # For OSD it also is a possible to specify requests/limits based on device class
  #   osd-hdd:
  #   osd-ssd:
  #   osd-nvme:
  #   prepareosd:
  #   mgr-sidecar:
  #   crashcollector:
  #   logcollector:
  #   cleanup:
  #   exporter:
  #   cmd-reporter:
  # The option to automatically remove OSDs that are out and are safe to destroy.
  removeOSDsIfOutAndSafeToRemove: false
  priorityClassNames:
    #all: rook-ceph-default-priority-class
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
    #crashcollector: rook-ceph-crashcollector-priority-class
  storage: # cluster level storage configuration and selection
    useAllNodes: false
    useAllDevices: false
    #deviceFilter:
    config:
      # crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
      # metadataDevice: "md0" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
      # databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
      # osdsPerDevice: "1" # this value can be overridden at the node or device level
      # encryptedDevice: "true" # the default value for this option is "false"
      # deviceClass: "myclass" # specify a device class for OSDs in the cluster
    allowDeviceClassUpdate: false # whether to allow changing the device class of an OSD after it is created
    allowOsdCrushWeightUpdate: false # whether to allow resizing the OSD crush weight after osd pvc is increased
    # Individual nodes and their config can be specified as well, but 'useAllNodes' above must be set to false. Then, only the named
    # nodes below will be used as storage resources.  Each node's 'name' field should match their 'kubernetes.io/hostname' label.
    nodes:
      - name: pve-ubuntu
        devices:
          - name: vda
      - name: pve-gmk-ubuntu
        devices:
          - name: vda
      - name: nhan-ubuntu
        devices:
          - name: vdb
    # nodes:
    #   - name: "172.17.4.201"
    #     devices: # specific devices to use for storage can be specified for each node
    #       - name: "sdb"
    #       - name: "nvme01" # multiple osds can be created on high performance devices
    #         config:
    #           osdsPerDevice: "5"
    #       - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # devices can be specified using full udev paths
    #     config: # configuration can be specified at the node level which overrides the cluster level config
    #   - name: "172.17.4.301"
    #     deviceFilter: "^sd."
    # Whether to always schedule OSD pods on nodes declared explicitly in the "nodes" section, even if they are
    # temporarily not schedulable. If set to true, consider adding placement tolerations for unschedulable nodes.
    scheduleAlways: false
    # when onlyApplyOSDPlacement is false, will merge both placement.All() and placement.osd
    onlyApplyOSDPlacement: false
    # Time for which an OSD pod will sleep before restarting, if it stopped due to flapping
    # flappingRestartIntervalHours: 24
    # The ratio at which Ceph should block IO if the OSDs are too full. The default is 0.95.
    # fullRatio: 0.95
    # The ratio at which Ceph should stop backfilling data if the OSDs are too full. The default is 0.90.
    # backfillFullRatio: 0.90
    # The ratio at which Ceph should raise a health warning if the OSDs are almost full. The default is 0.85.
    # nearFullRatio: 0.85
  # The section for configuring management of daemon disruptions during upgrade or fencing.
  disruptionManagement:
    # If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
    # via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
    # block eviction of OSDs by default and unblock them safely when drains are detected.
    managePodBudgets: true
    # A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
    # default DOWN/OUT interval) when it is draining. This is only relevant when  `managePodBudgets` is `true`. The default value is `30` minutes.
    osdMaintenanceTimeout: 30
    # A duration in minutes that the operator will wait for the placement groups to become healthy (active+clean) after a drain was completed and OSDs came back up.
    # Operator will continue with the next drain if the timeout exceeds. It only works if `managePodBudgets` is `true`.
    # No values or 0 means that the operator will wait until the placement groups are healthy before unblocking the next drain.
    pgHealthCheckTimeout: 0

  # csi defines CSI Driver settings applied per cluster.
  csi:
    readAffinity:
      # Enable read affinity to enable clients to optimize reads from an OSD in the same topology.
      # Enabling the read affinity may cause the OSDs to consume some extra memory.
      # For more details see this doc:
      # https://rook.io/docs/rook/latest/Storage-Configuration/Ceph-CSI/ceph-csi-drivers/#enable-read-affinity-for-rbd-volumes
      enabled: false

    # cephfs driver specific settings.
    cephfs:
      # Set CephFS Kernel mount options to use https://docs.ceph.com/en/latest/man/8/mount.ceph/#options.
      # kernelMountOptions: ""
      # Set CephFS Fuse mount options to use https://docs.ceph.com/en/latest/man/8/ceph-fuse/#options.
      # fuseMountOptions: ""

  # healthChecks
  # Valid values for daemons are 'mon', 'osd', 'status'
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    # Change pod liveness probe timing or threshold values. Works for all mon,mgr,osd daemons.
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
    # Change pod startup probe timing or threshold values. Works for all mon,mgr,osd daemons.
    startupProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

验证安装结果:

ziyuan@pve-gmk-ubuntu:~/k8s$ kubectl get pod -n rook-ceph
NAME                                                     READY   STATUS      RESTARTS        AGE
csi-cephfsplugin-29kd9                                   3/3     Running     2 (3d16h ago)   3d18h
csi-cephfsplugin-k572j                                   3/3     Running     0               3d18h
csi-cephfsplugin-provisioner-74b95c5758-6k7w9            6/6     Running     0               3d18h
csi-cephfsplugin-provisioner-74b95c5758-ktxhw            6/6     Running     0               3d18h
csi-cephfsplugin-s77bw                                   3/3     Running     0               3d18h
csi-rbdplugin-9jshj                                      3/3     Running     0               3d18h
csi-rbdplugin-kl4sv                                      3/3     Running     1 (3d18h ago)   3d18h
csi-rbdplugin-provisioner-5fd9fbf6f8-225sl               6/6     Running     0               3d18h
csi-rbdplugin-provisioner-5fd9fbf6f8-w9r5s               6/6     Running     0               3d18h
csi-rbdplugin-skjc8                                      3/3     Running     0               3d18h
rook-ceph-crashcollector-nhan-ubuntu-ddbddd8d9-2frpp     1/1     Running     0               3d17h
rook-ceph-crashcollector-pve-gmk-ubuntu-89bf6dcf-cwcmg   1/1     Running     0               3d17h
rook-ceph-crashcollector-pve-ubuntu-68cff9fd47-dnbpf     1/1     Running     0               3d17h
rook-ceph-exporter-nhan-ubuntu-745dbcf58c-hqcq4          1/1     Running     0               3d17h
rook-ceph-exporter-pve-gmk-ubuntu-6f777d855c-fd54v       1/1     Running     0               3d17h
rook-ceph-exporter-pve-ubuntu-76f44c66fc-2hlw7           1/1     Running     0               3d17h
rook-ceph-mgr-a-c69f44568-bnvlh                          3/3     Running     0               3d17h
rook-ceph-mgr-b-d66bcd6d7-6xqqg                          3/3     Running     0               3d17h
rook-ceph-mon-a-54c677947c-4bhnf                         2/2     Running     1 (31h ago)     3d18h
rook-ceph-mon-b-fdf46d6f8-nb9jb                          2/2     Running     0               3d17h
rook-ceph-mon-c-6498cbd495-lfvp4                         2/2     Running     0               3d17h
rook-ceph-operator-68d9f8b984-w7tmj                      1/1     Running     0               3d18h
rook-ceph-osd-0-6dccbf576-2vq6q                          2/2     Running     0               3d17h
rook-ceph-osd-1-79cd54f5bd-ppkpr                         2/2     Running     0               3d17h
rook-ceph-osd-2-6d75f79768-rkmrb                         2/2     Running     0               3d17h
rook-ceph-osd-prepare-nhan-ubuntu-hgrqf                  0/1     Completed   0               3d17h
rook-ceph-osd-prepare-pve-gmk-ubuntu-v4gwg               0/1     Completed   0               3d17h
rook-ceph-osd-prepare-pve-ubuntu-x5vz4                   0/1     Completed   0               3d17h

最后安装 toolbox 和 dashboard:

$ kubectl apply -f dashoard-ingress.yaml
$ kubectl apply -f toolbox-ingress.yaml

dashboard-ingress.yaml 文件内容如下：

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rook-ceph-mgr-dashboard
  namespace: rook-ceph
spec:
  ingressClassName: nginx
  rules:
  - host: rook-ceph-dashboard.ziyuan360.host
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rook-ceph-mgr-dashboard
            port:
              number: 7000

toolbox.yaml 文件内容如下：

点击展开

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rook-ceph-tools
  namespace: rook-ceph # namespace:cluster
  labels:
    app: rook-ceph-tools
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rook-ceph-tools
  template:
    metadata:
      labels:
        app: rook-ceph-tools
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: rook-ceph-default
      containers:
        - name: rook-ceph-tools
          image: quay.io/ceph/ceph:v19
          command:
            - /bin/bash
            - -c
            - |
              # Replicate the script from toolbox.sh inline so the ceph image
              # can be run directly, instead of requiring the rook toolbox
              CEPH_CONFIG="/etc/ceph/ceph.conf"
              MON_CONFIG="/etc/rook/mon-endpoints"
              KEYRING_FILE="/etc/ceph/keyring"

              # create a ceph config file in its default location so ceph/rados tools can be used
              # without specifying any arguments
              write_endpoints() {
                endpoints=$(cat ${MON_CONFIG})

                # filter out the mon names
                # external cluster can have numbers or hyphens in mon names, handling them in regex
                # shellcheck disable=SC2001
                mon_endpoints=$(echo "${endpoints}"| sed 's/[a-z0-9_-]\+=//g')

                DATE=$(date)
                echo "$DATE writing mon endpoints to ${CEPH_CONFIG}: ${endpoints}"
                  cat <<EOF > ${CEPH_CONFIG}
              [global]
              mon_host = ${mon_endpoints}

              [client.admin]
              keyring = ${KEYRING_FILE}
              EOF
              }

              # watch the endpoints config file and update if the mon endpoints ever change
              watch_endpoints() {
                # get the timestamp for the target of the soft link
                real_path=$(realpath ${MON_CONFIG})
                initial_time=$(stat -c %Z "${real_path}")
                while true; do
                  real_path=$(realpath ${MON_CONFIG})
                  latest_time=$(stat -c %Z "${real_path}")

                  if [[ "${latest_time}" != "${initial_time}" ]]; then
                    write_endpoints
                    initial_time=${latest_time}
                  fi

                  sleep 10
                done
              }

              # read the secret from an env var (for backward compatibility), or from the secret file
              ceph_secret=${ROOK_CEPH_SECRET}
              if [[ "$ceph_secret" == "" ]]; then
                ceph_secret=$(cat /var/lib/rook-ceph-mon/secret.keyring)
              fi

              # create the keyring file
              cat <<EOF > ${KEYRING_FILE}
              [${ROOK_CEPH_USERNAME}]
              key = ${ceph_secret}
              EOF

              # write the initial config file
              write_endpoints

              # continuously update the mon endpoints if they fail over
              watch_endpoints
          imagePullPolicy: IfNotPresent
          tty: true
          securityContext:
            runAsNonRoot: true
            runAsUser: 2016
            runAsGroup: 2016
            capabilities:
              drop: ["ALL"]
          env:
            - name: ROOK_CEPH_USERNAME
              valueFrom:
                secretKeyRef:
                  name: rook-ceph-mon
                  key: ceph-username
          volumeMounts:
            - mountPath: /etc/ceph
              name: ceph-config
            - name: mon-endpoint-volume
              mountPath: /etc/rook
            - name: ceph-admin-secret
              mountPath: /var/lib/rook-ceph-mon
              readOnly: true
      volumes:
        - name: ceph-admin-secret
          secret:
            secretName: rook-ceph-mon
            optional: false
            items:
              - key: ceph-secret
                path: secret.keyring
        - name: mon-endpoint-volume
          configMap:
            name: rook-ceph-mon-endpoints
            items:
              - key: data
                path: mon-endpoints
        - name: ceph-config
          emptyDir: {}
      tolerations:
        - key: "node.kubernetes.io/unreachable"
          operator: "Exists"
          effect: "NoExecute"
          tolerationSeconds: 5

利用 toolbox 验证集群是否正常:

ziyuan@pve-gmk-ubuntu:~/k8s$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
bash-5.1$ ceph status
  cluster:
    id:     87554414-e7da-449e-bc2e-549a11a92c1c
    health: HEALTH_WARN
            mon a is low on available space
 
  services:
    mon: 3 daemons, quorum b,a,c (age 4m)
    mgr: a(active, since 3d), standbys: b
    osd: 3 osds: 3 up (since 3d), 3 in (since 3d)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   81 MiB used, 60 GiB / 60 GiB avail
    pgs:     1 active+clean

我这里倒是因为 node 磁盘空间不足产生了 WARN，扩容即可。

Bpazy · 2025-01-24T06:00:49Z

创建 CephFS 给 K8s Pod 使用

首先创建 fs:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: myfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
    - name: replicated
      replicated:
        size: 3
  # 设置为true表示：删除文件系统时不要删除文件
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true

然后创建 storageclass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs-monitoring
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  # clusterID is the namespace where the rook cluster is running
  # If you change this namespace, also change the namespace below where the secret namespaces are defined
  clusterID: rook-ceph

  # CephFS filesystem name into which the volume shall be created
  fsName: myfs

  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: myfs-replicated

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

# 删除PVC时保留数据
reclaimPolicy: Retain

最后创建 PVC，我这里是要给 uptime 使用:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfs-uptime
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: rook-cephfs-monitoring
  resources:
    requests:
      storage: 2Gi

最后配置 pod 使用即可:

# 省略了其他配置...
    spec:
      containers:
      - name: uptime-kuma
        image: louislam/uptime-kuma:1.23.13
        ports:
        - containerPort: 3001
        volumeMounts:
        - name: uptime-kuma-pvc-local
          mountPath: /app/data
      volumes:
      - name: uptime-kuma-pvc-local
        persistentVolumeClaim:
          claimName: cephfs-uptime

Bpazy · 2025-01-24T06:07:31Z

对 Ceph 压测

新建一个工具 Pod:

apiVersion: v1
kind: Pod
metadata:
  name: migration-pod
  namespace: monitoring
spec:
  volumes:
    - name: cephfs-uptime
      persistentVolumeClaim:
        claimName: cephfs-uptime
  containers:
    - name: ubuntu-container
      image: ubuntu:latest
      command: ["/bin/bash", "-c", "--"]
      args: ["while true; do sleep 5; done;"]
      volumeMounts:
        - name: cephfs-uptime
          mountPath: /data/uptime2

启动并进入容器:

ziyuan@pve-gmk-ubuntu:~/k8s$ kubectl -n monitoring exec -it migration-pod -- /bin/bash

压测:

记住添加oflag参数以绕过磁盘页面缓存

root@migration-pod:/data/uptime2# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 104.89 s, 10.2 MB/s

root@migration-pod:/data/uptime2# dd if=/dev/zero of=512 bs=512 count=1000 oflag=direct
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 66.6174 s, 7.7 kB/s

可以看到性能差的离谱，下一步排查性能。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

搭建 Rook Ceph #350

搭建 Rook Ceph #350

Bpazy commented Jan 3, 2025 •

edited

Loading

Bpazy commented Jan 8, 2025 •

edited

Loading

Bpazy commented Jan 24, 2025

Bpazy commented Jan 24, 2025

搭建 Rook Ceph #350

搭建 Rook Ceph #350

Comments

Bpazy commented Jan 3, 2025 • edited Loading

Bpazy commented Jan 8, 2025 • edited Loading

1. 先安装 rook ceph operator

2. 配置 rook ceph 集群

Bpazy commented Jan 24, 2025

创建 CephFS 给 K8s Pod 使用

Bpazy commented Jan 24, 2025

对 Ceph 压测

Bpazy commented Jan 3, 2025 •

edited

Loading

Bpazy commented Jan 8, 2025 •

edited

Loading