Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump to k8s 1.31.1 #4759

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Bump to k8s 1.31.1 #4759

wants to merge 3 commits into from

Conversation

jcaamano
Copy link
Contributor

@jcaamano jcaamano commented Oct 7, 2024

For go-controller:

go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get sigs.k8s.io/[email protected]
go mod vendor && go mod tidy

Fixed API changes
Fixed linting
Updated codegen

For e2e tests:

go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get github.com/ovn-org/ovn-kubernetes/go-controller
go mod edit -replace github.com/coreos/go-iptables=github.com/trozet/[email protected]
go mod tidy

(konnectivity-client is not at 0.31 yet)

Fixed API changes

@jcaamano jcaamano requested a review from a team as a code owner October 7, 2024 16:32
@github-actions github-actions bot added kind/documentation All issues related to documentation area/unit-testing Issues related to adding/updating unit tests area/e2e-testing feature/services&endpoints All issues related to the Servces/Endpoints API feature/kubevirt-live-migration All issues related to kubevirt live migration feature/egress-qos labels Oct 7, 2024
@github-actions github-actions bot added feature/admin-network-policy feature/egress-service Issues related to egress service feature/egress-gateway All issues related to ICNI/APBR feature/egress-firewall All issues related to egress firewall labels Oct 7, 2024
@jcaamano jcaamano force-pushed the bump-1.31 branch 3 times, most recently from 87df54b to 085bc2e Compare October 8, 2024 16:44
@npinaeva
Copy link
Member

npinaeva commented Oct 9, 2024

not sure if you would like to include this, but we were going to add some CEL for the UDN CRD, there are TODOs like this https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/crd/userdefinednetwork/v1/types.go#L213
it just needs to be un-commented

@jcaamano jcaamano force-pushed the bump-1.31 branch 2 times, most recently from 9b0e1e3 to e5d9182 Compare October 9, 2024 11:48
@jcaamano
Copy link
Contributor Author

jcaamano commented Oct 9, 2024

not sure if you would like to include this, but we were going to add some CEL for the UDN CRD, there are TODOs like this https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/crd/userdefinednetwork/v1/types.go#L213 it just needs to be un-commented

Are there no tests for those CELs?

@@ -0,0 +1 @@
vendor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add end of line

@jcaamano jcaamano force-pushed the bump-1.31 branch 2 times, most recently from 01e2cf0 to 0c4b1a2 Compare October 10, 2024 18:13
flavio-fernandes added a commit to flavio-fernandes/ovn-kubernetes that referenced this pull request Oct 10, 2024
@jcaamano jcaamano force-pushed the bump-1.31 branch 2 times, most recently from 1a78431 to 28554da Compare October 11, 2024 14:16
@npinaeva
Copy link
Member

Are there no tests for those CELs?

not really :(

@jcaamano jcaamano force-pushed the bump-1.31 branch 3 times, most recently from 4842690 to 4878de8 Compare October 14, 2024 10:21
@flavio-fernandes
Copy link
Contributor

Failed job: e2e (control-plane, noHA, local, ipv6, noSnatGW, 1br, ic-single-node-zones)

https://github.com/ovn-org/ovn-kubernetes/actions/runs/11325594330/job/31494087027?pr=4759

Summarizing 7 Failures:
  [FAIL] EgressService Multiple Networks, external clients sharing ip [LGW] Should validate pods on different networks can reach different clients with same ip without SNAT [It] ipv6 pods
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egress_services.go:1234
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace is brought up with the initial ACL log severity when the denied destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace is brought up with the initial ACL log severity when the allowed destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging annotation is updated when the denied destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging annotation is updated when the allowed destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging allow annotation is removed when the denied destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging allow annotation is removed when the allowed destination is poked there should be no trace in the ACL logs
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542

@jcaamano
Copy link
Contributor Author

Failed job: e2e (control-plane, noHA, local, ipv6, noSnatGW, 1br, ic-single-node-zones)

https://github.com/ovn-org/ovn-kubernetes/actions/runs/11325594330/job/31494087027?pr=4759

Summarizing 7 Failures:
  [FAIL] EgressService Multiple Networks, external clients sharing ip [LGW] Should validate pods on different networks can reach different clients with same ip without SNAT [It] ipv6 pods
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/egress_services.go:1234
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace is brought up with the initial ACL log severity when the denied destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace is brought up with the initial ACL log severity when the allowed destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging annotation is updated when the denied destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging annotation is updated when the allowed destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging allow annotation is removed when the denied destination is poked the logs should have the expected log level
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542
  [FAIL] ACL Logging for EgressFirewall [BeforeEach] when the namespace's ACL logging allow annotation is removed when the allowed destination is poked there should be no trace in the ACL logs
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/acl_logging.go:542

The problem is a crash in kube controller manager which delays quite a bit the creation of the default service account for namespaces

2024-10-14T11:27:20.573127782Z stderr F 	goroutine 735 [running]:
2024-10-14T11:27:20.573132561Z stderr F 	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x38432a0, 0x5545b00}, {0x2d99000, 0x5480790})
2024-10-14T11:27:20.573137951Z stderr F 		k8s.io/apimachinery/pkg/util/runtime/runtime.go:107 +0xbc
2024-10-14T11:27:20.573154772Z stderr F 	k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x38432a0, 0x5545b00}, {0x2d99000, 0x5480790}, {0x5545b00, 0x0, 0x43d945?})
2024-10-14T11:27:20.573159501Z stderr F 		k8s.io/apimachinery/pkg/util/runtime/runtime.go:82 +0x5e
2024-10-14T11:27:20.573187343Z stderr F 	k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001e5ec40?})
2024-10-14T11:27:20.573192182Z stderr F 		k8s.io/apimachinery/pkg/util/runtime/runtime.go:59 +0x108
2024-10-14T11:27:20.57319655Z stderr F 	panic({0x2d99000?, 0x5480790?})
2024-10-14T11:27:20.573200548Z stderr F 		runtime/panic.go:770 +0x132
2024-10-14T11:27:20.573204836Z stderr F 	k8s.io/cloud-provider/controllers/service.(*Controller).needsUpdate(0xc000978000, 0xc00324c288, 0xc000cf3688)
2024-10-14T11:27:20.573210016Z stderr F 		k8s.io/cloud-provider/controllers/service/controller.go:606 +0xcbb
2024-10-14T11:27:20.573214644Z stderr F 	k8s.io/cloud-provider/controllers/service.New.func2({0x326c8e0?, 0xc00324c288?}, {0x326c8e0, 0xc000cf3688?})
2024-10-14T11:27:20.573219133Z stderr F 		k8s.io/cloud-provider/controllers/service/controller.go:144 +0x74
2024-10-14T11:27:20.573224062Z stderr F 	k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
2024-10-14T11:27:20.573228831Z stderr F 		k8s.io/client-go/tools/cache/controller.go:253
2024-10-14T11:27:20.573233339Z stderr F 	k8s.io/client-go/tools/cache.(*processorListener).run.func1()
2024-10-14T11:27:20.573237758Z stderr F 		k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea
2024-10-14T11:27:20.573241875Z stderr F 	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
2024-10-14T11:27:20.57326562Z stderr F 		k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
2024-10-14T11:27:20.573270419Z stderr F 	k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00311cf70, {0x380e700, 0xc001e3a8a0}, 0x1, 0xc001e3e780)
2024-10-14T11:27:20.573274767Z stderr F 		k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
2024-10-14T11:27:20.573279556Z stderr F 	k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001e5a770, 0x3b9aca00, 0x0, 0x1, 0xc001e3e780)
2024-10-14T11:27:20.573284085Z stderr F 		k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
2024-10-14T11:27:20.573288433Z stderr F 	k8s.io/apimachinery/pkg/util/wait.Until(...)
2024-10-14T11:27:20.573292821Z stderr F 		k8s.io/apimachinery/pkg/util/wait/backoff.go:161
2024-10-14T11:27:20.573297059Z stderr F 	k8s.io/client-go/tools/cache.(*processorListener).run(0xc000892d80)
2024-10-14T11:27:20.573302339Z stderr F 		k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69
2024-10-14T11:27:20.573306677Z stderr F 	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
2024-10-14T11:27:20.573310975Z stderr F 		k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52
2024-10-14T11:27:20.573315874Z stderr F 	created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 690
2024-10-14T11:27:20.573320913Z stderr F 		k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

The problem is a race condition where the cloud provider controller may initialize the event recorder too late
https://github.com/kubernetes/kubernetes/blob/948afe5ca072329a73c8e79ed5938717a5cb3d21/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L224

introduced with
kubernetes/kubernetes@50c1243

We will most likely see this more. Thinking what to do but we will probably have to stay with 1.30.2 at runtime for a bit.

@flavio-fernandes
Copy link
Contributor

We will most likely see this more. Thinking what to do but we will probably have to stay with 1.30.2 at runtime for a bit.

TY for the info @jcaamano ! Assuming we are not planning on merging this any time soon, we may rebase network qos pr to master branch instead. It would be nice to address the codegen changes, tho. Do you think that could be done in an iterim pr
(ref: #4759 (comment) <= #4748 ) ?

@jcaamano
Copy link
Contributor Author

We will most likely see this more. Thinking what to do but we will probably have to stay with 1.30.2 at runtime for a bit.

TY for the info @jcaamano ! Assuming we are not planning on merging this any time soon, we may rebase network qos pr to master branch instead. It would be nice to address the codegen changes, tho. Do you think that could be done in an iterim pr (ref: #4759 (comment) <= #4748 ) ?

Trying to understand the problem a bit more. It doesn't look like that is the first crash. There had to be another one before that is different because the manager should not be starting at the same time the tests are running.

Also, one other option would be to disable the service-lb-controller for the time being as long as that doesn't break something else.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
For go-controller:

go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get sigs.k8s.io/[email protected]
go mod vendor && go mod tidy

Fixed API changes
Fixed linting
Updated codegen

For e2e tests:

go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get github.com/ovn-org/ovn-kubernetes/go-controller
go mod edit -replace github.com/coreos/go-iptables=github.com/trozet/[email protected]
go mod tidy

(konnectivity-client is not at 0.31 yet)

Fixed API changes
Fixed skip for some upstream e2e tests that were added and we don't
support

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
@flavio-fernandes
Copy link
Contributor

known flake: #4480

@tssurya tssurya added this to the v1.1.0 milestone Oct 16, 2024
It seems that v1.31.1 introduced a bug in kube manager's
service-lb-controller. Since we don't use a cloud provider, the
controller is not fully initialized and started. However, its handlers
are added to the informer and they do run. And when they do, it crashes
because it is not fully initialized.

Probably introduced through:
kubernetes/kubernetes@50c1243

Disable service-lb-controller since it is not used anyway.

bootstrap-signer-controller and token-cleaner-controller need to be
added since they are not default and would otherwise be added by kind
but not if we are overriding.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-testing area/unit-testing Issues related to adding/updating unit tests feature/admin-network-policy feature/egress-firewall All issues related to egress firewall feature/egress-gateway All issues related to ICNI/APBR feature/egress-qos feature/egress-service Issues related to egress service feature/kubevirt-live-migration All issues related to kubevirt live migration feature/services&endpoints All issues related to the Servces/Endpoints API kind/documentation All issues related to documentation
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants