KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs #2313

akshaychitneni · 2024-10-28T17:51:26Z

What this PR does / why we need it:

Add cel validations on runtime crds for v2 apis

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

coveralls · 2024-10-28T17:56:24Z

Pull Request Test Coverage Report for Build 11560677304

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11542660312:	0.0%
Covered Lines:	77
Relevant Lines:	77

💛 - Coveralls

andreyvelich · 2024-10-28T17:58:13Z

Fixes: #2219

andreyvelich

Thanks for this @akshaychitneni !
I left a few comments.

pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go

andreyvelich · 2024-11-26T18:58:56Z

pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go

@@ -173,6 +176,8 @@ type TorchMLPolicySource struct {
 	// Supported values: `auto`, `cpu`, `gpu`, or int value.
 	// TODO (andreyvelich): Add kubebuilder validation.
 	// Defaults to `auto`.
+	// +kubebuilder:default="auto"
+	// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be auto,cpu,gpu strings or int value"
 	NumProcPerNode *string `json:"numProcPerNode,omitempty"`


@tenzen-y @akshaychitneni Should we use the intstr for NumProcPerNode here ?

andreyvelich · 2024-11-26T19:16:40Z

pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go

@@ -209,13 +214,15 @@ type MPIMLPolicySource struct {

 	// Implementation name for the MPI to create the appropriate hostfile.
 	// Defaults to OpenMPI.
+	// +kubebuilder:default="OpenMPI"
 	MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"`

 	// Directory where SSH keys are mounted.
 	SSHAuthMountPath *string `json:"SSHAuthMountPath,omitempty"`


@tenzen-y @alculquicondor Do we have the default directory for SSH.
I can see it here: https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v2beta1/types.go#L190-L191

test/integration/controller.v2/trainingruntime_controller_test.go

andreyvelich · 2025-02-12T23:17:59Z

@akshaychitneni You need to run go mod tidy to fix the unit tests.

andreyvelich

Thank you @akshaychitneni!
I think, we are almost ready to merge it, just a few comments.

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

andreyvelich · 2025-02-12T23:22:29Z

pkg/runtime/framework/plugins/torch/torch.go

+	if numProcPerNode.Type == intstr.Int {
+		numProcPerNodeVal = strconv.Itoa(int(numProcPerNode.IntVal))
+	} else {
+		numProcPerNodeVal = numProcPerNode.StrVal
+	}


Should just always take StrVal from numProcPerNode here since we assign it to env after ?

andreyvelich · 2025-02-12T23:25:07Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

-	NumProcPerNode *string `json:"numProcPerNode,omitempty"`
+	// +kubebuilder:default="auto"
+	// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value"
+	NumProcPerNode *intstr.IntOrString `json:"numProcPerNode,omitempty"`


@akshaychitneni @astefanutti @kubeflow/wg-training-leads @tenzen-y @Electronic-Waste In general, what are the benefits for the end-user to have IntOrString over string for NumProcPerNode API ?
E.g. we can always validate that this value is int if selected TrainingRuntime doesn't support string values.

cc @kannon92 for feedback here.

default is set to auto though?

Not really, since mpirun doesn't support auto for OpenMPI:
https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php#toc9

E.g. in MPI slot == proc

From the OpenAPI perspective do we see any problems to keep intOrString type ?

I fully agree with using *intstr.IntOrString better typed power.
@andreyvelich It seems that only you have objections for intstr.IntOrString, maybe?

Actually, I asked about whether this type is ok to use from the API perspective (e.g. be compatible with OpenAPI and other formats).
For example, if someone is building API Server on top of TrainingRuntimes, it might be hard to implement IntOrString type since it is not a default type.

@tenzen-y Don't you see any concerns with that ?
Does SIG Architecture in k8s have any recommendations for it ?
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

For example, if someone is building API Server on top of TrainingRuntimes, it might be hard to implement IntOrString type since it is not a default type.

Here, what exactly mean "APIServer"? something like web backendserver or some else?
In that case, typically I would recommend defining the protocol buffer API to convert CRD to internal API.

Basically, it would be better not to directly expose CRD as internal API.

@tenzen-y Don't you see any concerns with that ?
Does SIG Architecture in k8s have any recommendations for it ?
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

Actually, IntOrString is maintained by core kube. So using the typed is recommended way.
If they abondant IntOrString, Deployment and EndPointSlice or someelse will be broken since the objects depend on the typed.

test/integration/webhooks/clustertrainingruntime_test.go

test/integration/webhooks/trainingruntime_test.go

tenzen-y · 2025-02-13T11:19:49Z

Sorry for the delayed response. Let me try to review this PR, again

tenzen-y

Basically, lgtm

pkg/runtime/core/trainingruntime_test.go

pkg/runtime/framework/plugins/torch/torch.go

tenzen-y · 2025-02-14T18:15:50Z

sdk/python/test-requirements.txt

Which changes did bring SDK changes?

This seem to be autogenerated. cc @andreyvelich

Actually, it should be removed according to these: https://github.com/kubeflow/trainer/blob/master/hack/python-sdk/gen-sdk.sh#L54-L55
Interesting, why it didn't work for you.

test/integration/webhooks/clustertrainingruntime_webhook_test.go

tenzen-y · 2025-02-14T18:18:15Z

test/integration/webhooks/trainingruntime_webhook_test.go

test/integration/webhooks/trainingruntime_webhook_test.go? We already declare that this is webhook testings as package name.

Should we use the original name?

@tenzen-y What do you mean ?
Name it as this ?

trainingruntime_test.go

I just tried to be consistent with what we have in JobSet and Kueue:
https://github.com/kubernetes-sigs/kueue/tree/main/test/integration/singlecluster/webhook/jobs

Uhm, actually, those are unintended...
Alright, here let me drop this request.

test/integration/webhooks/trainingruntime_webhook_test.go

tenzen-y · 2025-02-14T18:26:13Z

@akshaychitneni make manifests could resolve current CI errors.

andreyvelich · 2025-02-14T18:52:41Z

/lgtm
We can followup for the remaining changes in the next PRs.
Thank you for the contribution @akshaychitneni!

andreyvelich · 2025-02-14T18:52:47Z

/assign @tenzen-y

Signed-off-by: Akshay Chitneni <[email protected]>

tenzen-y

Thank you and sorry for delayed review!
I added my first comments to your another webhook validation PR as well.
If you can address those, I would appreciate that!

/lgtm
/approve

google-oss-prow · 2025-02-14T19:15:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2025-02-14T19:15:33Z

/hold for CI

tenzen-y · 2025-02-14T19:16:01Z

If you find all CI completed, feel free to merge this PR with /hold cancel comment.

tenzen-y · 2025-02-14T19:40:12Z

Get ready
Thank you, again!
/hold cancel

google-oss-prow bot requested review from jinchihe and kuizhiqing October 28, 2024 17:51

google-oss-prow bot added the size/L label Oct 28, 2024

akshaychitneni force-pushed the runtimecel branch from 51408e5 to 0e9654d Compare October 28, 2024 17:52

akshaychitneni force-pushed the runtimecel branch from 0e9654d to 1c323fb Compare October 28, 2024 18:51

akshaychitneni mentioned this pull request Nov 11, 2024

KEP-2170: Adding validation webhook for v2 trainjob #2307

Open

1 task

akshaychitneni force-pushed the runtimecel branch from 1c323fb to d023258 Compare November 11, 2024 19:18

andreyvelich reviewed Nov 26, 2024

View reviewed changes

andreyvelich reviewed Jan 2, 2025

View reviewed changes

test/integration/controller.v2/trainingruntime_controller_test.go Outdated Show resolved Hide resolved

akshaychitneni force-pushed the runtimecel branch 4 times, most recently from b1a8e9b to 1736070 Compare February 3, 2025 21:49

akshaychitneni force-pushed the runtimecel branch from 1736070 to 519c251 Compare February 12, 2025 15:47

andreyvelich reviewed Feb 12, 2025

View reviewed changes

andreyvelich mentioned this pull request Feb 13, 2025

KEP-2170: Add validation to Torch numProcPerNode field #2409

Merged

1 task

akshaychitneni force-pushed the runtimecel branch 5 times, most recently from b9ec5ff to 7dc735c Compare February 14, 2025 18:17

tenzen-y reviewed Feb 14, 2025

View reviewed changes

akshaychitneni force-pushed the runtimecel branch from 7dc735c to 96092d3 Compare February 14, 2025 18:37

google-oss-prow bot assigned andreyvelich Feb 14, 2025

google-oss-prow bot added the lgtm label Feb 14, 2025

google-oss-prow bot assigned tenzen-y Feb 14, 2025

akshaychitneni force-pushed the runtimecel branch from 96092d3 to a263f57 Compare February 14, 2025 19:00

google-oss-prow bot removed the lgtm label Feb 14, 2025

Adding cel validation on trainingRuntime CRD

4836eb5

Signed-off-by: Akshay Chitneni <[email protected]>

akshaychitneni force-pushed the runtimecel branch from a263f57 to 4836eb5 Compare February 14, 2025 19:08

tenzen-y reviewed Feb 14, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Feb 14, 2025

google-oss-prow bot added the approved label Feb 14, 2025

google-oss-prow bot added the do-not-merge/hold label Feb 14, 2025

google-oss-prow bot removed the do-not-merge/hold label Feb 14, 2025

google-oss-prow bot merged commit 9b3b1de into kubeflow:master Feb 14, 2025
14 checks passed

andreyvelich mentioned this pull request Feb 14, 2025

KEP-2170: Implement validations for TrainingRuntime and ClusterTrainingRuntime #2219

Open

KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs #2313

KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs #2313

Conversation

akshaychitneni commented Oct 28, 2024

coveralls commented Oct 28, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11560677304

Details

💛 - Coveralls

andreyvelich commented Oct 28, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Feb 12, 2025

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 13, 2025

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 14, 2025

andreyvelich commented Feb 14, 2025

andreyvelich commented Feb 14, 2025

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Feb 14, 2025

tenzen-y commented Feb 14, 2025

tenzen-y commented Feb 14, 2025

tenzen-y commented Feb 14, 2025

coveralls commented Oct 28, 2024 •

edited

Loading

andreyvelich Feb 12, 2025 •

edited

Loading