Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpointslice sync latency regression starting from v0.16.0 #1623

Closed
t0lya opened this issue Aug 21, 2024 · 3 comments · Fixed by #1625
Closed

Endpointslice sync latency regression starting from v0.16.0 #1623

t0lya opened this issue Aug 21, 2024 · 3 comments · Fixed by #1625
Assignees
Labels
bug Something isn't working

Comments

@t0lya
Copy link

t0lya commented Aug 21, 2024

What happened:

We are upgrading submariner-operator from v0.15.2 to v0.17.2 in our fleet of clusters. After upgrade we have noticed endpointslice sync latency regression. Regression is particularly bad in fleet that has large number of endpointslices in broker namespace (>1500 endpointslices). We are seeing 5-10min latency, whereas before upgrade syncs were almost instant.

We looked at endpointslice update logs in broker and local clusters. Local to broker sync is still fast. Delay is during broker to local syncing.

Looking at lighthouse-agent debug logs, we are seeing all endpointslices constantly getting reprocessed. Even though most endpointslices have no changes to sync.

Looks like regression was introduced in v0.16.0 with following change #1308 which set endpointslice resync period to 2 min.

var BrokerResyncPeriod = time.Minute * 2

We forked lighthouse-agent with the resync disabled (var BrokerResyncPeriod = time.Duration(0) ). We deployed the fork to our fleet and were able to confirm the sync latency went back to normal.

What you expected to happen:

We would like to be able to disable resync through Helm chart rather than through custom build. We do not require endpointslice resync, since we do not have use case where namespaces are deleted and immediately recreated.

Also it seems like endpointslice sync workqueue is only processed by single thread? Can we increase sync throughput and improve latency by implementing worker thread pool? Otherwise resync can block new events from being processed for >10 mins as seen in our case.

Also are there any metrics to track sync latency? If not, is it possible to add?

How to reproduce it (as minimally and precisely as possible):
Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.

Anything else we need to know?:

Environment: Linux

  • Diagnose information (use subctl diagnose all):
  • Gather information (use subctl gather):
  • Cloud provider or hardware configuration: Azure Kubernetes
  • Install tools: Helm chart
  • Others:
@t0lya t0lya added the bug Something isn't working label Aug 21, 2024
@tpantelis
Copy link
Contributor

tpantelis commented Aug 21, 2024

Thanks for researching and reporting the issue.

The problem is not the processing of the Endpointslices from the queue but the rate limiting of the queue. We've seen issues with that processing a lot of items in a short period of time and reduced the rate limiting to improve it. But the 2 minute resync period can definitely be problematic at larger scale. That was put in to easily handle a hypothetical edge case that isn't likely to happen in reality. We could bump the resync period to much longer but, if the scenario actually did occur, that probably wouldn't be satisfactory anyway. I think we should set the resync period back to 0 - the potential performance hit to handle that edge case in this manner isn't worth it. We could look into trying to handle it in a more performant manner later. @vthapar @aswinsuryan WDYT

@t0lya
Copy link
Author

t0lya commented Aug 22, 2024

For rate limiting reduction, are you referring to this change submariner-io/admiral#904? Did it make it into v0.17.2? We tested with v0.17.2 and experienced endpointslice sync latency on this version too.

@tpantelis
Copy link
Contributor

tpantelis commented Aug 22, 2024

For rate limiting reduction, are you referring to this change submariner-io/admiral#904? Did it make it into v0.17.2? We tested with v0.17.2 and experienced endpointslice sync latency on this version too.

yeah that reduced the latency. With rate limiting, there will always be some artificial latency - that's the purpose of it. And that's OK but it's maximized when there's a lot of items queued in a very short amount of time, which is what happens on a resync, so that the overall bucket rate limiter kicks in so we want to avoid doing that or it should be infrequent.

@tpantelis tpantelis self-assigned this Aug 22, 2024
@tpantelis tpantelis moved this to In Progress in Submariner 0.19 Aug 22, 2024
tpantelis added a commit to tpantelis/lighthouse that referenced this issue Aug 22, 2024
This has proved to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes submariner-io#1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit to tpantelis/lighthouse that referenced this issue Aug 22, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes submariner-io#1623

Signed-off-by: Tom Pantelis <[email protected]>
@tpantelis tpantelis moved this from In Progress to In Review in Submariner 0.19 Aug 22, 2024
skitt pushed a commit that referenced this issue Aug 26, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes #1623

Signed-off-by: Tom Pantelis <[email protected]>
@github-project-automation github-project-automation bot moved this from In Review to Done in Submariner 0.19 Aug 26, 2024
tpantelis added a commit to tpantelis/lighthouse that referenced this issue Aug 26, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes submariner-io#1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit to tpantelis/lighthouse that referenced this issue Aug 26, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes submariner-io#1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit to tpantelis/lighthouse that referenced this issue Aug 26, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes submariner-io#1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit to tpantelis/lighthouse that referenced this issue Aug 26, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes submariner-io#1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit that referenced this issue Aug 27, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes #1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit that referenced this issue Aug 27, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes #1623

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit that referenced this issue Aug 27, 2024
This has proven to be problematic at larger scale, significantly
increasing latencies due to the rate limiting. It was put in to
easily handle a hypothetical, unlikely edge case where a service
namespace is deleted then recreated but the potential performance
hit to handle it in this manner isn't worth it. Now that the
resource syncer can watch namespaces, we can handle it there.

Fixes #1623

Signed-off-by: Tom Pantelis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants