-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpointslice sync latency regression starting from v0.16.0 #1623
Comments
Thanks for researching and reporting the issue. The problem is not the processing of the |
For rate limiting reduction, are you referring to this change submariner-io/admiral#904? Did it make it into v0.17.2? We tested with v0.17.2 and experienced endpointslice sync latency on this version too. |
yeah that reduced the latency. With rate limiting, there will always be some artificial latency - that's the purpose of it. And that's OK but it's maximized when there's a lot of items queued in a very short amount of time, which is what happens on a resync, so that the overall bucket rate limiter kicks in so we want to avoid doing that or it should be infrequent. |
This has proved to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes #1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes #1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes #1623 Signed-off-by: Tom Pantelis <[email protected]>
This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes #1623 Signed-off-by: Tom Pantelis <[email protected]>
What happened:
We are upgrading submariner-operator from v0.15.2 to v0.17.2 in our fleet of clusters. After upgrade we have noticed endpointslice sync latency regression. Regression is particularly bad in fleet that has large number of endpointslices in broker namespace (>1500 endpointslices). We are seeing 5-10min latency, whereas before upgrade syncs were almost instant.
We looked at endpointslice update logs in broker and local clusters. Local to broker sync is still fast. Delay is during broker to local syncing.
Looking at lighthouse-agent debug logs, we are seeing all endpointslices constantly getting reprocessed. Even though most endpointslices have no changes to sync.
Looks like regression was introduced in v0.16.0 with following change #1308 which set endpointslice resync period to 2 min.
lighthouse/pkg/agent/controller/types.go
Line 45 in 1aaa9dc
We forked lighthouse-agent with the resync disabled (
var BrokerResyncPeriod = time.Duration(0)
). We deployed the fork to our fleet and were able to confirm the sync latency went back to normal.What you expected to happen:
We would like to be able to disable resync through Helm chart rather than through custom build. We do not require endpointslice resync, since we do not have use case where namespaces are deleted and immediately recreated.
Also it seems like endpointslice sync workqueue is only processed by single thread? Can we increase sync throughput and improve latency by implementing worker thread pool? Otherwise resync can block new events from being processed for >10 mins as seen in our case.
Also are there any metrics to track sync latency? If not, is it possible to add?
How to reproduce it (as minimally and precisely as possible):
Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.
Anything else we need to know?:
Environment: Linux
subctl diagnose all
):subctl gather
):The text was updated successfully, but these errors were encountered: