Endpointslice sync latency regression starting from v0.16.0 #1623

t0lya · 2024-08-21T20:39:39Z

What happened:

We are upgrading submariner-operator from v0.15.2 to v0.17.2 in our fleet of clusters. After upgrade we have noticed endpointslice sync latency regression. Regression is particularly bad in fleet that has large number of endpointslices in broker namespace (>1500 endpointslices). We are seeing 5-10min latency, whereas before upgrade syncs were almost instant.

We looked at endpointslice update logs in broker and local clusters. Local to broker sync is still fast. Delay is during broker to local syncing.

Looking at lighthouse-agent debug logs, we are seeing all endpointslices constantly getting reprocessed. Even though most endpointslices have no changes to sync.

Looks like regression was introduced in v0.16.0 with following change #1308 which set endpointslice resync period to 2 min.

lighthouse/pkg/agent/controller/types.go

Line 45 in 1aaa9dc

var BrokerResyncPeriod = time.Minute * 2

We forked lighthouse-agent with the resync disabled (var BrokerResyncPeriod = time.Duration(0) ). We deployed the fork to our fleet and were able to confirm the sync latency went back to normal.

What you expected to happen:

We would like to be able to disable resync through Helm chart rather than through custom build. We do not require endpointslice resync, since we do not have use case where namespaces are deleted and immediately recreated.

Also it seems like endpointslice sync workqueue is only processed by single thread? Can we increase sync throughput and improve latency by implementing worker thread pool? Otherwise resync can block new events from being processed for >10 mins as seen in our case.

Also are there any metrics to track sync latency? If not, is it possible to add?

How to reproduce it (as minimally and precisely as possible):
Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.

Anything else we need to know?:

Environment: Linux

Diagnose information (use subctl diagnose all):
Gather information (use subctl gather):
Cloud provider or hardware configuration: Azure Kubernetes
Install tools: Helm chart
Others:

The text was updated successfully, but these errors were encountered:

tpantelis · 2024-08-21T21:24:39Z

Thanks for researching and reporting the issue.

The problem is not the processing of the Endpointslices from the queue but the rate limiting of the queue. We've seen issues with that processing a lot of items in a short period of time and reduced the rate limiting to improve it. But the 2 minute resync period can definitely be problematic at larger scale. That was put in to easily handle a hypothetical edge case that isn't likely to happen in reality. We could bump the resync period to much longer but, if the scenario actually did occur, that probably wouldn't be satisfactory anyway. I think we should set the resync period back to 0 - the potential performance hit to handle that edge case in this manner isn't worth it. We could look into trying to handle it in a more performant manner later. @vthapar @aswinsuryan WDYT

t0lya · 2024-08-22T01:52:11Z

For rate limiting reduction, are you referring to this change submariner-io/admiral#904? Did it make it into v0.17.2? We tested with v0.17.2 and experienced endpointslice sync latency on this version too.

tpantelis · 2024-08-22T02:00:12Z

For rate limiting reduction, are you referring to this change submariner-io/admiral#904? Did it make it into v0.17.2? We tested with v0.17.2 and experienced endpointslice sync latency on this version too.

yeah that reduced the latency. With rate limiting, there will always be some artificial latency - that's the purpose of it. And that's OK but it's maximized when there's a lot of items queued in a very short amount of time, which is what happens on a resync, so that the overall bucket rate limiter kicks in so we want to avoid doing that or it should be infrequent.

This has proved to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>

This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>

This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes #1623 Signed-off-by: Tom Pantelis <[email protected]>

This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes submariner-io#1623 Signed-off-by: Tom Pantelis <[email protected]>

This has proven to be problematic at larger scale, significantly increasing latencies due to the rate limiting. It was put in to easily handle a hypothetical, unlikely edge case where a service namespace is deleted then recreated but the potential performance hit to handle it in this manner isn't worth it. Now that the resource syncer can watch namespaces, we can handle it there. Fixes #1623 Signed-off-by: Tom Pantelis <[email protected]>

t0lya added the bug Something isn't working label Aug 21, 2024

tpantelis self-assigned this Aug 22, 2024

tpantelis added this to Submariner 0.19 Aug 22, 2024

tpantelis moved this to In Progress in Submariner 0.19 Aug 22, 2024

tpantelis mentioned this issue Aug 22, 2024

Remove broker resync period for ServiceImport and EndpointSlices #1625

Merged

tpantelis moved this from In Progress to In Review in Submariner 0.19 Aug 22, 2024

skitt closed this as completed in #1625 Aug 26, 2024

github-project-automation bot moved this from In Review to Done in Submariner 0.19 Aug 26, 2024

tpantelis mentioned this issue Jan 16, 2025

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpointslice sync latency regression starting from v0.16.0 #1623

Endpointslice sync latency regression starting from v0.16.0 #1623

t0lya commented Aug 21, 2024 •

edited

Loading

tpantelis commented Aug 21, 2024 •

edited

Loading

t0lya commented Aug 22, 2024

tpantelis commented Aug 22, 2024 •

edited

Loading

Endpointslice sync latency regression starting from v0.16.0 #1623

Endpointslice sync latency regression starting from v0.16.0 #1623

Comments

t0lya commented Aug 21, 2024 • edited Loading

tpantelis commented Aug 21, 2024 • edited Loading

t0lya commented Aug 22, 2024

tpantelis commented Aug 22, 2024 • edited Loading

t0lya commented Aug 21, 2024 •

edited

Loading

tpantelis commented Aug 21, 2024 •

edited

Loading

tpantelis commented Aug 22, 2024 •

edited

Loading