Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APP.4.4.A19 #45

Open
sluetze opened this issue Nov 7, 2023 · 7 comments
Open

APP.4.4.A19 #45

sluetze opened this issue Nov 7, 2023 · 7 comments
Assignees
Labels
new-rules Issue which requires us to write new rules

Comments

@sluetze
Copy link

sluetze commented Nov 7, 2023

No description provided.

@ermeratos
Copy link

A Kubernetes operation SHOULD be set up in such a way that if a site fails, the clusters (and
thus the applications in the pods) either continue to run without interruption or can be
restarted in a short time at another site.

Should a restart be required, all the necessary configuration files, images, user data, network
connections, and other resources required for operation (including the necessary hardware)
SHOULD already be available at the alternative site.

For the uninterrupted operation of clusters, the control plane of Kubernetes, the infrastructure
applications of the clusters, and the pods of the applications SHOULD be distributed across
several fire zones based on the location data of the corresponding nodes so that the failure of a
fire zone will not lead to the failure of an application.

@ermeratos ermeratos added not-checkable Requirement can not be checked with Compliance Operator org-only This Requirement of BSI is ONLY an organizational Requirement labels Dec 15, 2023
@ermeratos ermeratos moved this from Todo to Evaluation in sig-bsi-grundschutz tracking Dec 15, 2023
@benruland
Copy link

benruland commented Dec 18, 2023

For the uninterrupted operation of clusters, the control plane of Kubernetes, the infrastructure
applications of the clusters, and the pods of the applications SHOULD be distributed across
several fire zones based on the location data of the corresponding nodes so that the failure of a
fire zone will not lead to the failure of an application.

We could check, if the nodes (potentially individually for master and worker nodes) have labels set for topology.kubernetes.io/zone. This would indicate a distribution of nodes across "fire zones".

@sluetze
Copy link
Author

sluetze commented Jan 5, 2024

additional we might check if there are multiple masters/workers, missing masters are quite surely an indicator of missing distribution.

while checking masters might be easy, the check of workers might be difficult, because a user could have several nodetypes. maybe we could check each machineconfigset, if the number of selected nodes is higher than 1?.

i cannot identify any checks for this in the upstream

@ermeratos ermeratos added new-rules Issue which requires us to write new rules and removed org-only This Requirement of BSI is ONLY an organizational Requirement not-checkable Requirement can not be checked with Compliance Operator labels Jan 30, 2024
@benruland benruland self-assigned this Mar 6, 2024
@benruland benruland moved this from Evaluation to Implementation in sig-bsi-grundschutz tracking Mar 6, 2024
@benruland
Copy link

Ongoing implementation in ComplianceAsCode#11659

@sluetze sluetze moved this from Implementation to Upstream PR in sig-bsi-grundschutz tracking Mar 18, 2024
@benruland
Copy link

benruland commented Jul 15, 2024

I am unsure, whether to include a rule that checks deployments and statefulsets, if their pods are spread across nodes or zones using anti-affinity and/or topologySpreadConstraints.

While it is technically possible (I have implemented it), it results in a lot of results, e.g.:

[
  "ansible-automation-platform/aap001-hub-content",
  "ansible-automation-platform/aap001-hub-worker",
  "argocd/argocd-dex-server",
  "argocd/argocd-redis",
  "argocd/argocd-repo-server",
  "argocd/argocd-server",
  "app-x/app-x-worker",
  "iwo-collector/iwo-k8s-collector-cisco-intersight",
  "nextcloud/nextcloud-operator-controller-manager",
  "openshift-apiserver-operator/openshift-apiserver-operator",
  "openshift-authentication-operator/authentication-operator",
  "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator",
  "openshift-cloud-credential-operator/cloud-credential-operator",
  "openshift-cluster-machine-approver/machine-approver",
  "openshift-cluster-node-tuning-operator/cluster-node-tuning-operator",
  "openshift-cluster-samples-operator/cluster-samples-operator",
  "openshift-cluster-storage-operator/cluster-storage-operator",
  "openshift-cluster-storage-operator/csi-snapshot-controller-operator",
  "openshift-cluster-version/cluster-version-operator",
  "openshift-compliance/compliance-operator",
  "openshift-compliance/ocp4-openshift-compliance-pp",
  "openshift-compliance/rhcos4-openshift-compliance-pp",
  "openshift-compliance/upstream-ocp4-bsi-node-master-rs",
  "openshift-compliance/upstream-ocp4-bsi-node-worker-rs",
  "openshift-compliance/upstream-ocp4-bsi-rs",
  "openshift-compliance/upstream-ocp4-openshift-compliance-pp",
  "openshift-compliance/upstream-rhcos4-bsi-master-rs",
  "openshift-compliance/upstream-rhcos4-bsi-worker-rs",
  "openshift-compliance/upstream-rhcos4-openshift-compliance-pp",
  "openshift-config-operator/openshift-config-operator",
  "openshift-console-operator/console-operator",
  "openshift-controller-manager-operator/openshift-controller-manager-operator",
  "openshift-dns-operator/dns-operator",
  "openshift-etcd-operator/etcd-operator",
  "openshift-gitops/cluster",
  "openshift-gitops/kam",
  "openshift-image-registry/cluster-image-registry-operator",
  "openshift-ingress-operator/ingress-operator",
  "openshift-insights/insights-operator",
  "openshift-kube-apiserver-operator/kube-apiserver-operator",
  "openshift-kube-controller-manager-operator/kube-controller-manager-operator",
  "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator",
  "openshift-kube-storage-version-migrator-operator/kube-storage-version-migrator-operator",
  "openshift-kube-storage-version-migrator/migrator",
  "openshift-machine-api/cluster-autoscaler-operator",
  "openshift-machine-api/cluster-baremetal-operator",
  "openshift-machine-api/control-plane-machine-set-operator",
  "openshift-machine-api/machine-api-operator",
  "openshift-machine-config-operator/machine-config-controller",
  "openshift-machine-config-operator/machine-config-operator",
  "openshift-marketplace/marketplace-operator",
  "openshift-monitoring/cluster-monitoring-operator",
  "openshift-monitoring/kube-state-metrics",
  "openshift-monitoring/openshift-state-metrics",
  "openshift-monitoring/prometheus-operator",
  "openshift-monitoring/telemeter-client",
  "openshift-multus/multus-admission-controller",
  "openshift-network-diagnostics/network-check-source",
  "openshift-operator-lifecycle-manager/catalog-operator",
  "openshift-operator-lifecycle-manager/olm-operator",
  "openshift-operator-lifecycle-manager/package-server-manager",
  "openshift-operators/gitlab-runner-gitlab-runnercontroller-manager",
  "openshift-operators/gitops-operator-controller-manager",
  "openshift-operators/pgo",
  "openshift-service-ca-operator/service-ca-operator",
  "openshift-service-ca/service-ca",
  "redhat-ods-applications/data-science-pipelines-operator-controller-manager",
  "redhat-ods-applications/etcd",
  "redhat-ods-applications/notebook-controller-deployment",
  "redhat-ods-applications/odh-notebook-controller-manager",
  "redhat-ods-operator/rhods-operator",
  "trident/trident-controller",
  "trident/trident-operator"
]

When filtering for deployments that have > 1 replicas I get:

[
  "ansible-automation-platform/aap001-hub-content",
  "ansible-automation-platform/aap001-hub-worker",
  "argocd/argocd-repo-server",
  "argocd/argocd-server",
  "app-x/app-x-worker",
  "iwo-collector/iwo-k8s-collector-cisco-intersight",
  "openshift-multus/multus-admission-controller"
]

I believe for a multitude of deployments, it is totally valid to not configure high availability and restarts are sufficient...
Making exclusion configurable is possible but will likely be painful.

Need input @ermeratos @sluetze ! Options I see:
a) Do not include a rule at all
b) Only consider deployments that have >1 replicas -> Those are intended for HA and should hence be spread evenly
c) Consider all deployments and statefulsets, make exclusion configurable

-> I have implemented Variant b) with configurable exclusion (c) for now

@sluetze
Copy link
Author

sluetze commented Jul 15, 2024

As our customers tend to want to have a rule rather than not having it (they can tailor it out at any time) and you already have done the implementation work I would go with b + c. The exclusion seems to be necessary for such rules, as we had several occurences of hard-coded exclusions which needed to become configurable afterwards.

@benruland
Copy link

During rebasing, I accidentially closed the previous PR. For better reviewability, I created a new PR: ComplianceAsCode#12155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-rules Issue which requires us to write new rules
Projects
Status: Upstream PR
Development

No branches or pull requests

3 participants