-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs #2313
Conversation
51408e5
to
0e9654d
Compare
Pull Request Test Coverage Report for Build 11560677304Details
💛 - Coveralls |
Fixes: #2219 |
0e9654d
to
1c323fb
Compare
1c323fb
to
d023258
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @akshaychitneni !
I left a few comments.
@@ -173,6 +176,8 @@ type TorchMLPolicySource struct { | |||
// Supported values: `auto`, `cpu`, `gpu`, or int value. | |||
// TODO (andreyvelich): Add kubebuilder validation. | |||
// Defaults to `auto`. | |||
// +kubebuilder:default="auto" | |||
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be auto,cpu,gpu strings or int value" | |||
NumProcPerNode *string `json:"numProcPerNode,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @akshaychitneni Should we use the intstr
for NumProcPerNode here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -209,13 +214,15 @@ type MPIMLPolicySource struct { | |||
|
|||
// Implementation name for the MPI to create the appropriate hostfile. | |||
// Defaults to OpenMPI. | |||
// +kubebuilder:default="OpenMPI" | |||
MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"` | |||
|
|||
// Directory where SSH keys are mounted. | |||
SSHAuthMountPath *string `json:"SSHAuthMountPath,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @alculquicondor Do we have the default directory for SSH.
I can see it here: https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v2beta1/types.go#L190-L191
test/integration/controller.v2/trainingruntime_controller_test.go
Outdated
Show resolved
Hide resolved
test/integration/controller.v2/trainingruntime_controller_test.go
Outdated
Show resolved
Hide resolved
b1a8e9b
to
1736070
Compare
1736070
to
519c251
Compare
@akshaychitneni You need to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @akshaychitneni!
I think, we are almost ready to merge it, just a few comments.
if numProcPerNode.Type == intstr.Int { | ||
numProcPerNodeVal = strconv.Itoa(int(numProcPerNode.IntVal)) | ||
} else { | ||
numProcPerNodeVal = numProcPerNode.StrVal | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just always take StrVal from numProcPerNode here since we assign it to env after ?
NumProcPerNode *string `json:"numProcPerNode,omitempty"` | ||
// +kubebuilder:default="auto" | ||
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value" | ||
NumProcPerNode *intstr.IntOrString `json:"numProcPerNode,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akshaychitneni @astefanutti @kubeflow/wg-training-leads @tenzen-y @Electronic-Waste In general, what are the benefits for the end-user to have IntOrString
over string
for NumProcPerNode
API ?
E.g. we can always validate that this value is int
if selected TrainingRuntime doesn't support string
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kannon92 for feedback here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is set to auto though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, since mpirun
doesn't support auto for OpenMPI:
https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php#toc9
E.g. in MPI slot == proc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the OpenAPI perspective do we see any problems to keep intOrString
type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree with using *intstr.IntOrString
better typed power.
@andreyvelich It seems that only you have objections for intstr.IntOrString
, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I asked about whether this type is ok to use from the API perspective (e.g. be compatible with OpenAPI and other formats).
For example, if someone is building API Server on top of TrainingRuntimes, it might be hard to implement IntOrString
type since it is not a default type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Don't you see any concerns with that ?
Does SIG Architecture in k8s have any recommendations for it ?
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if someone is building API Server on top of TrainingRuntimes, it might be hard to implement IntOrString type since it is not a default type.
Here, what exactly mean "APIServer"? something like web backendserver or some else?
In that case, typically I would recommend defining the protocol buffer API to convert CRD to internal API.
Basically, it would be better not to directly expose CRD as internal API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Don't you see any concerns with that ?
Does SIG Architecture in k8s have any recommendations for it ?
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
Actually, IntOrString
is maintained by core kube. So using the typed is recommended way.
If they abondant IntOrString
, Deployment and EndPointSlice or someelse will be broken since the objects depend on the typed.
Sorry for the delayed response. Let me try to review this PR, again |
b9ec5ff
to
7dc735c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, lgtm
sdk/python/test-requirements.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which changes did bring SDK changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem to be autogenerated. cc @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it should be removed according to these: https://github.com/kubeflow/trainer/blob/master/hack/python-sdk/gen-sdk.sh#L54-L55
Interesting, why it didn't work for you.
test/integration/webhooks/clustertrainingruntime_webhook_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/integration/webhooks/trainingruntime_webhook_test.go
? We already declare that this is webhook testings as package name.
Should we use the original name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y What do you mean ?
Name it as this ?
trainingruntime_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to be consistent with what we have in JobSet and Kueue:
https://github.com/kubernetes-sigs/kueue/tree/main/test/integration/singlecluster/webhook/jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, actually, those are unintended...
Alright, here let me drop this request.
@akshaychitneni |
7dc735c
to
96092d3
Compare
/lgtm |
/assign @tenzen-y |
96092d3
to
a263f57
Compare
Signed-off-by: Akshay Chitneni <[email protected]>
a263f57
to
4836eb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you and sorry for delayed review!
I added my first comments to your another webhook validation PR as well.
If you can address those, I would appreciate that!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold for CI |
If you find all CI completed, feel free to merge this PR with |
Get ready |
What this PR does / why we need it:
Add cel validations on runtime crds for v2 apis
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: