shipwright-io · openshift-merge-bot · Jun 26, 2024 · May 15, 2024 · Jun 3, 2024 · Jun 14, 2024
diff --git a/ships/0039-build-scheduler-opts.md b/ships/0039-build-scheduler-opts.md
@@ -0,0 +1,282 @@
+<!--
+Copyright The Shipwright Contributors
+
+SPDX-License-Identifier: Apache-2.0
+-->
+
+---
+title: build-scheduler-options
+authors:
+  - "@adambkaplan"
+reviewers:
+  - "@apoorvajagtap"
+  - "@HeavyWombat"
+approvers:
+  - "@qu1queee"
+  - "@SaschaSchwarze0"
+creation-date: 2024-05-15
+last-updated: 2024-06-20
+status: Implementable
+see-also: []
+replaces: []
+superseded-by: []
+---
+
+# Build Scheduler Options
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [x] Design details are appropriately documented from clear requirements
+- [x] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [docs](/docs/)
+
+## Open Questions [optional]
+
+- Should this be enabled always? Should we consider an alpha -> beta lifecycle for this feature? (ex: off by default -> on by default)
+
+## Summary
+
+Add API options that influece where `BuildRun` pods are scheduled on Kubernetes. This can be
+acomplished through the following mechanisms:
+
+- [Node Selectors](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector)
+- [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
+- [Custom Schedulers](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/)
+
+## Motivation
+
+Today, `BuildRun` pods will run on arbitrary nodes - developers, platform engineers, and admins do
+not have the ability to control where a specific build pod will be scheduled. Teams may have
+several motivations for controlling where a build pod is scheduled:
+
+- Builds can be CPU/memory/storage intensive. Scheduling on larger worker nodes with additional
+  memory or compute can help ensure builds succeed.
+- Clusters may have mutiple worker node architectures and even OS (Windows nodes). Container images
+  are by their nature specific to the OS and CPU architecture, and default to the host operating
+  system and architecture. Builds may need to specify OS and architecture through node selectors.
+- The default Kubernetes scheduler may not efficiently schedule build workloads - especially
+  considering how Tekton implements step containers and sidecars. A custom scheduler optimized for
+  Tekton or other batch workloads may lead to better cluster utulization.
+
+### Goals
+
+- Allow build pods to run on specific nodes using node selectors.
+- Allow build pods to tolerate node taints.
+- Allow build pods to use a custom scheduler.
+
+### Non-Goals
+
+- Primary feature support for multi-arch builds.
+- Allow node selection, pod affinity, and taint toleration to be set at the cluster level.
+  While this may be desirable, it requires a more sophisticated means of configuring the build
+  controller. Setting default values for scheduling options can be considered as a follow-up
+  feature.
+- Prevent use of build pod scheduling fields. This is best left to an admission controller like
+  [OPA Gatekeeper](https://www.openpolicyagent.org/docs/latest/kubernetes-introduction/) or
+  [Kyverno](https://kyverno.io/).
+- Allow build pods to set node affinity/anti-affinity rules. Affinity/anti-affinity is an
+  incredibly rich and complex API (see [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
+  for more information). We should strive to provide a simpler interface that is tailored
+  specifically to builds. This feature is being dropped to narrow the scope of this SHIP. Build
+  affinity rules can/should be addressed in a follow up feature.
+
+## Proposal
+
+### User Stories
+
+#### Node Selection - platform engineer
+
+As a platform engineer, I want builds to use node selectors to ensure they are scheduled on nodes
+optimized for builds so that builds are more likely to succeed
+
+#### Node Selection - arch-specific images
+
+As a developer, I want to select the OS and architecture of my build's node so that I can run
+builds on worker nodes with multiple architectures.
+
+#### Taint toleration - cluster admin
+
+As a cluster admin, I want builds to be able to tolerate provided node taints so that they can
+be scheduled on nodes that are not suitable/designated for application workloads.
+
+#### Custom Scheduler
+
+As a platform engineer/cluster admin, I want builds to use a custom scheduler so that I can provide
+my own scheduler that is optimized for my build workloads.
+
+### Implementation Notes
+
+#### API Updates
+
+The `BuildSpec` API for Build and BuildRun will be updated to add the following fields:
+
+```yaml
+spec:
+  ...
+  nodeSelector: # map[string]string
+    <node-label>: "label-value"
+  tolerations: # []Toleration
+    - key: "taint-key"
+      operator: Exists|Equal
+      value: "taint-value"
+  schedulerName: "custom-scheduler-name" # string
+```
+
+The `nodeSelector` and `schedulerName` fields will use golang primitives that match their k8s
+equivalents.
+
+#### Tolerations
+
+The Tolerations API for Shipwright will support a limited subset of the upstream Kubernetes
+Tolerations API. For simplicity, any Shipwright Build or BuildRun with a toleration set will use
+the `NoSchedule` [taint effect](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
+
+```yaml
+spec:
+  tolerations: # Optional array
+    - key: "taint-key" # Aligns with upstream k8s taint labels. Required
+      operator: Exists|Equal # Aligns with upstream k8s - key exists or node label key = value. Required
+      value: "taint-value" # Alights with upstream k8s taint value. Optional.
+```
+
+As with upstream k8s, the Shipwright Tolerations API array should support
+[strategic merge JSON patching](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#notes-on-the-strategic-merge-patch).
+
+#### Precedence Ordering and Value Merging
+
+Values in `BuildRun` will override those in the referenced `Build` object (if present). Values for
+`nodeSelector` and `tolerations` should use strategic merge logic when possible:
+
+- `nodeSelector` merges using map keys. If the map key is present in the `Build` and `BuildRun`
+  object, the `BuildRun` overrides the value.
+- `tolerations` merges using the taint key. If the taint key is present in the `Build` and
+  `BuildRun` object, the `BuildRun` overrides the value.
+
+This allows the `BuildRun` object to "inherit" values from a parent `Build` object.
+
+#### Impact on Tekton TaskRun
+
+Tekton supports tuning the pod of the `TaskRun` using the
+[podTemplate](https://tekton.dev/docs/pipelines/taskruns/#specifying-a-pod-template) field. When
+Shipwright creates the `TaskRun` for a build, the respective node selector, tolerations, and
+scheduler name can be passed through.
+
+#### Command Line Enhancements
+
+The `shp` CLI _may_ be enhanced to add flags that set the node selector, tolerations, and custom
+scheduler for a `BuildRun`. For example, `shp build run` can have the following new options:
+
+- `--node=<key>=<value>`: Use the node label key/value pair in the selector. Can be set more than
+  once for multiple key/value pairs..
+- `--tolerate=<key>` or `--tolerate=<key>=<value>`: Tolerate the taint key, in one of two ways:
+  - First form: taint key `Exists`.
+  - Second form: taint key `Equals` provided value.
+  - In both cases, this flag can be set more than once.
+- `--scheduler=<name>`: use custom scheduler with given name. Can only be set once.
+
+
+#### Hardening Guidelines
+
+Exposing `nodeSelector` and `tolerations` to end developers adds risk with respect to overall
+system availability. Some platform teams may not want these Kubernetes internals exposed or
+modifiable by end developers at all. To address these concerns, a hardening guideline for
+Shipwright Builds should also be published alongside documentation for this feature. This guideline
+should recommend the use of third party admission controllers (ex: OPA, Kyverno) to prevent builds
+from using values that impact system availability and performance. For example:
+
+- Block toleration of `node.kubernetes.io/*` taints. These are reserved for nodes that are not
+  ready to receive workloads for scheduling.
+- Block node selectors with the `node-role.kubernetes.io/control-plane` label key. This is reserved
+  for control plane components (`kube-apiserver`, `kube-controller-manager`, etc.)
+- Block toleration of the `node-role.kubernetes.io/control-plane` taint key. Same as above.
+
+See the [well known labels](https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane)
+documentation for more information.
+
+### Test Plan
+
+- Unit testing can verify that the generated `TaskRun` object for a build contains the desired pod
+  template fields.
+- End to end tests using `KinD` is possible for the `nodeSelector` and `tolerations` fields:
+  - KinD has support for configuring multiple [nodes](https://kind.sigs.k8s.io/docs/user/configuration/#nodes)
+  - Once set up, KinD nodes can simulate real nodes when it comes to pod scheduling, node labeling,
+    and node taints.
+- End to end testing for the `schedulerName` field requires the deployment of a custom scheduler,
+  plus code to verify that the given scheduler was used. This is non-trivial (see
+  [upstream example](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#specify-schedulers-for-pods))
+  and adds a potential failure point to the test suite. Relying on unit testing alone is our best
+  option.
+
+
+### Release Criteria
+
+TBD
+
+**Note:** *Section not required until targeted at a release.*
+
+#### Removing a deprecated feature [if necessary]
+
+Not applicable.
+
+#### Upgrade Strategy [if necessary]
+
+The top-level API fields will be optional and default to Golang empty values.
+On upgrade, these values will remain empty on existing `Build`/`BuildRun` objects.
+
+
+### Risks and Mitigations
+
+**Risk:** Node selector field allows disruptive workloads (builds) to be scheduled on control plane
+nodes.
+
+*Mitigation*: Hardening guideline added as a requirement for this feature. There may be some
+cluster topologies (ex: single node clusters) where scheduling builds on the "control plane" is not
+only desirable, but necessary. Hardening guidelines referencing third party admission controllers
+preserves flexibility while giving cluster administrators/platform teams the knowledge needed to
+harden their environments as they see fit.
+
+
+## Drawbacks
+
+Exposing these fields leaks - to a certain extent - our abstraction over Kubernetes. This proposal
+places k8s pod scheduling fields up front in the API for `Build` and `BuildRun`, a deviation from
+Tekton which exposes the fields through a `PodTemplate` sub-field. Cluster administrators may not
+want end developers to have control over where these pods are scheduled - they may instead wish to
+control pod scheduling through Tekton's
+[default pod template](https://github.com/tektoncd/pipeline/blob/main/docs/podtemplates.md#supported-fields)
+mechanism at the controller level.
+
+Exposing `nodeSelector` may also conflict with future enhancements to support
+[multi-architecture image builds](https://github.com/shipwright-io/build/issues/1119). A
+hypothetical build that fans out individual image builds to nodes with desired OS/architecture
+pairs may need to explicitly set the `k8s.io/os` and `k8s.io/architecture` node selector fields on
+generated `TaskRuns`. With that said, there is currently no mechanism for Shipwright to control
+where builds execute on clusters with multiple worker node architectures and operating systems.
+
+
+## Alternatives
+
+An earlier draft of this proposal included `affinity` for setting pod affinity/anti-affinity rules.
+This was rejected due to the complexities of Kubernetes pod affinity and anti-affinity. We need
+more concrete user stories from the community to understand what - if anything - we should do with
+respect to distributing build workloads through affinity rules. This may also conflict with
+Tekton's [affinity assistant](https://tekton.dev/docs/pipelines/affinityassistants/) feature - an optional configuration that is enabled by default in upstream Tekton.
+
+An earlier draft also included the ability to set default values for these fields at the cluster
+level. This would be similar to Tekton's capability with the Pipeline controller configuration.
+Since this option is available at the Tekton pipeline level, adding nearly identical features to
+Shipwright is being deferred. Tuning pod template values with the Tekton pipeline controller may
+also be an acceptable alternative to this feature in some circumstances.
+
+
+## Infrastructure Needed [optional]
+
+No additional infrastructure antipated.
+Test KinD clusters may need to deploy with additional nodes where these features can be verified.
+
+## Implementation History
+
+- 2024-05-15: Created as `provisional`
+- 2024-06-20: Draft updated to `implementable`