Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.1: excludes deprecated values from pull responses (backport of #3653) #4981

Open
wants to merge 2 commits into
base: v2.1
Choose a base branch
from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Feb 14, 2025

Problem

Shouldn't waste bandwidth on propagating deperecated values.

Summary of Changes

The commit excludes deprecated values from pull responses.


This is an automatic backport of pull request #3653 done by Mergify.

(cherry picked from commit cba5d47)

# Conflicts:
#	gossip/src/crds_data.rs
@mergify mergify bot added the conflicts label Feb 14, 2025
@mergify mergify bot requested a review from a team as a code owner February 14, 2025 01:37
Copy link
Author

mergify bot commented Feb 14, 2025

Cherry-pick of cba5d47 has failed:

On branch mergify/bp/v2.1/pr-3653
Your branch is up to date with 'origin/v2.1'.

You are currently cherry-picking commit cba5d472a.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   gossip/src/crds.rs

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   gossip/src/crds_data.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@behzadnouri behzadnouri force-pushed the mergify/bp/v2.1/pr-3653 branch from d0c4563 to 5c32f11 Compare February 14, 2025 02:03
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -435,7 +454,7 @@ impl Version {

#[cfg_attr(feature = "frozen-abi", derive(AbiExample))]
#[derive(Clone, Debug, PartialEq, Eq, Deserialize, Serialize)]
pub struct NodeInstance {
pub(crate) struct NodeInstance {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we still need NodeInstance::new(). Although looks like needed for tests...

@t-nelson
Copy link

can we get bp justification?

we're skipping testnet here no?

do we have metrics for frequency at which these messages are being propagated?

@gregcusack
Copy link

justification for removal of NodeInstance is that we need to speed up the removal of NodeInstance in gossip. Vast majority of gossip traffic spikes are from NodeInstance. This value is deprecated. TBH tho, we don't know yet why we are seeing huge spikes in NodeInstance yet.

@alexpyattaev
Copy link

I will re-check all the assumptions leading up to this just in case. It may be that this is actually part of the cause and not of the solution.

@behzadnouri
Copy link

can we get bp justification?

Regardless of what is causing the spikes, we need to get the deprecated values out of gossip asap because they are maintenance burden and a waste of bandwidth and can be a source of bugs/issues.
Additionally firedancer is forced to implement/support the deprecated values if they stay there any longer.

I will also assure you a change like this is a lot more impactful and effective than a bounded channel which indiscriminately drops packets.

@behzadnouri
Copy link

I will re-check all the assumptions leading up to this just in case. It may be that this is actually part of the cause and not of the solution.

@alexpyattaev can you also please help with debugging gossip spikes as well?
It is always good to have more eyes on the metrics, and it also helps you get to know how to debug mainnet issues based on the metrics/logs.

@alexpyattaev
Copy link

@alexpyattaev can you also please help with debugging gossip spikes as well? It is always good to have more eyes on the metrics, and it also helps you get to know how to debug mainnet issues based on the metrics/logs.

Collected gossip during a few spikes, investigating.

Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks appropriate for removing deprecated stuff from PullResponses, but I think it will do nothing for the current spikes. Spikes in load are 90% push messages, not pull responses, so my understanding is that these things are unrelated. Though I would prefer similar logic to be applied to push messages first to mitigate ongoing problems.

@behzadnouri
Copy link

This looks appropriate for removing deprecated stuff from PullResponses, but I think it will do nothing for the current spikes. Spikes in load are 90% push messages, not pull responses, so my understanding is that these things are unrelated. Though I would prefer similar logic to be applied to push messages first to mitigate ongoing problems.

There are 2 steps we need to take:

  1. This patch removes deprecated types from pull responses.
  2. Once the cluster has upgraded to this patch, then we can also remove deprecated types from push messages.

We cannot do (2) before the cluster has upgraded to (1) because then whatever is not pushed is returned in pull-responses anyways which adds even more overhead since pull requests are more expensive to process.

Also we cannot upgrade the cluster atomically to the next release, and that is why we need to do (1) and (2) in separate releases, and wait for the cluster to adequately upgrade to (1) first. Otherwise during the upgrade process where half of the cluster is running the old code, there will be a lot more packets because of nodes running old version sending the deprecated types to other nodes through pull-responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants