Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs #2313

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

akshaychitneni
Copy link
Contributor

What this PR does / why we need it:

Add cel validations on runtime crds for v2 apis

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Oct 28, 2024

Pull Request Test Coverage Report for Build 11560677304

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11542660312: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@andreyvelich
Copy link
Member

Fixes: #2219

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @akshaychitneni !
I left a few comments.

pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go Outdated Show resolved Hide resolved
@@ -173,6 +176,8 @@ type TorchMLPolicySource struct {
// Supported values: `auto`, `cpu`, `gpu`, or int value.
// TODO (andreyvelich): Add kubebuilder validation.
// Defaults to `auto`.
// +kubebuilder:default="auto"
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be auto,cpu,gpu strings or int value"
NumProcPerNode *string `json:"numProcPerNode,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @akshaychitneni Should we use the intstr for NumProcPerNode here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -209,13 +214,15 @@ type MPIMLPolicySource struct {

// Implementation name for the MPI to create the appropriate hostfile.
// Defaults to OpenMPI.
// +kubebuilder:default="OpenMPI"
MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"`

// Directory where SSH keys are mounted.
SSHAuthMountPath *string `json:"SSHAuthMountPath,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich
Copy link
Member

@akshaychitneni You need to run go mod tidy to fix the unit tests.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @akshaychitneni!
I think, we are almost ready to merge it, just a few comments.

pkg/apis/trainer/v1alpha1/trainingruntime_types.go Outdated Show resolved Hide resolved
Comment on lines 77 to 81
if numProcPerNode.Type == intstr.Int {
numProcPerNodeVal = strconv.Itoa(int(numProcPerNode.IntVal))
} else {
numProcPerNodeVal = numProcPerNode.StrVal
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just always take StrVal from numProcPerNode here since we assign it to env after ?

NumProcPerNode *string `json:"numProcPerNode,omitempty"`
// +kubebuilder:default="auto"
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value"
NumProcPerNode *intstr.IntOrString `json:"numProcPerNode,omitempty"`
Copy link
Member

@andreyvelich andreyvelich Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni @astefanutti @kubeflow/wg-training-leads @tenzen-y @Electronic-Waste In general, what are the benefits for the end-user to have IntOrString over string for NumProcPerNode API ?
E.g. we can always validate that this value is int if selected TrainingRuntime doesn't support string values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kannon92 for feedback here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is set to auto though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, since mpirun doesn't support auto for OpenMPI:
https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php#toc9

E.g. in MPI slot == proc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the OpenAPI perspective do we see any problems to keep intOrString type ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with using *intstr.IntOrString better typed power.
@andreyvelich It seems that only you have objections for intstr.IntOrString, maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I asked about whether this type is ok to use from the API perspective (e.g. be compatible with OpenAPI and other formats).
For example, if someone is building API Server on top of TrainingRuntimes, it might be hard to implement IntOrString type since it is not a default type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Don't you see any concerns with that ?
Does SIG Architecture in k8s have any recommendations for it ?
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if someone is building API Server on top of TrainingRuntimes, it might be hard to implement IntOrString type since it is not a default type.

Here, what exactly mean "APIServer"? something like web backendserver or some else?
In that case, typically I would recommend defining the protocol buffer API to convert CRD to internal API.

Basically, it would be better not to directly expose CRD as internal API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Don't you see any concerns with that ?
Does SIG Architecture in k8s have any recommendations for it ?
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

Actually, IntOrString is maintained by core kube. So using the typed is recommended way.
If they abondant IntOrString, Deployment and EndPointSlice or someelse will be broken since the objects depend on the typed.

test/integration/webhooks/clustertrainingruntime_test.go Outdated Show resolved Hide resolved
test/integration/webhooks/trainingruntime_test.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member

Sorry for the delayed response. Let me try to review this PR, again

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, lgtm

pkg/runtime/core/trainingruntime_test.go Show resolved Hide resolved
pkg/runtime/framework/plugins/torch/torch.go Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which changes did bring SDK changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to be autogenerated. cc @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it should be removed according to these: https://github.com/kubeflow/trainer/blob/master/hack/python-sdk/gen-sdk.sh#L54-L55
Interesting, why it didn't work for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/integration/webhooks/trainingruntime_webhook_test.go? We already declare that this is webhook testings as package name.

Should we use the original name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y What do you mean ?
Name it as this ?

trainingruntime_test.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to be consistent with what we have in JobSet and Kueue:
https://github.com/kubernetes-sigs/kueue/tree/main/test/integration/singlecluster/webhook/jobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, actually, those are unintended...
Alright, here let me drop this request.

test/integration/webhooks/trainingruntime_webhook_test.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member

@akshaychitneni make manifests could resolve current CI errors.

@andreyvelich
Copy link
Member

/lgtm
We can followup for the remaining changes in the next PRs.
Thank you for the contribution @akshaychitneni!

@google-oss-prow google-oss-prow bot added the lgtm label Feb 14, 2025
@andreyvelich
Copy link
Member

/assign @tenzen-y

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you and sorry for delayed review!
I added my first comments to your another webhook validation PR as well.
If you can address those, I would appreciate that!

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Feb 14, 2025
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/hold for CI

@tenzen-y
Copy link
Member

If you find all CI completed, feel free to merge this PR with /hold cancel comment.

@tenzen-y
Copy link
Member

Get ready
Thank you, again!
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 9b3b1de into kubeflow:master Feb 14, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants