Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quic: increase timeout and keep alive #4585

Merged
merged 4 commits into from
Jan 24, 2025
Merged

Conversation

0x0ece
Copy link

@0x0ece 0x0ece commented Jan 22, 2025

Problem

Quic timeout and keep alive are very low.
When a validator is not leader, it still has thousands of connections open that are not expected to send any data.
Because keep alive is 1s, each client sends a quic ping every second, which is a lot of unnecessary traffic.

Summary of Changes

Increased timeout to 60s and keep alive to 45s.

@mergify mergify bot requested a review from a team January 22, 2025 22:03
alessandrod
alessandrod previously approved these changes Jan 22, 2025
@lijunwangs
Copy link

Maybe we should different timeout values: stream timeout and idle connection timeout (controlled by the ping)

@lijunwangs
Copy link

We need to investigate why the CI is consistently failing for this PR. I am updating the branch again.

@lijunwangs lijunwangs merged commit a1d26ad into anza-xyz:master Jan 24, 2025
57 checks passed
@behzadnouri
Copy link

@0x0ece Looks like this change has made test_snapshots_restart_validity very difficult to pass.
See "6/10 local-cluster" which takes 9 retries to pass: https://buildkite.com/anza/agave/builds/17947
(on another PR I didn't have any luck with many retries but the CI passed once I reverted this change).

The CI for this change is green but I am guessing the combination of this change with another PR (maybe #4586 ) is causing the test to fail. I have confirmed though reverting this change resolves the issue: https://buildkite.com/anza/agave/builds/17958

This is currently blocking work because the CI is not usable. Can we please revert this and test the change again?

@lijunwangs
Copy link

@0x0ece Looks like this change has made test_snapshots_restart_validity very difficult to pass. See "6/10 local-cluster" which takes 9 retries to pass: https://buildkite.com/anza/agave/builds/17947 (on another PR I didn't have any luck with many retries but the CI passed once I reverted this change).

The CI for this change is green but I am guessing the combination of this change with another PR (maybe #4586 ) is causing the test to fail. I have confirmed though reverting this change resolves the issue: https://buildkite.com/anza/agave/builds/17958

This is currently blocking work because the CI is not usable. Can we please revert this and test the change again?

okay. We can revert it. I will revert immediately @behzadnouri

@0x0ece
Copy link
Author

0x0ece commented Jan 26, 2025

Ok, definitely let’s investigate.
FYI, after this I collected more data. In a TPU session we’re getting something like 1m packets with txs and 2.3m pings from peers that are just idle, not sending txs.

@lijunwangs
Copy link

Ok, definitely let’s investigate. FYI, after this I collected more data. In a TPU session we’re getting something like 1m packets with txs and 2.3m pings from peers that are just idle, not sending txs.

PR for revert: #4637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants