Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dag] refactor/improve dag health monitoring #11362

Merged
merged 9 commits into from
Feb 6, 2024
Merged

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Dec 15, 2023

Description

  • Refactors Chain health backoff by decoupling it from LeaderReputationAdapter.
  • Adds PipelineBackpressure mechanism
  • Organizes both chain health and pipeline backpressure into a health module
  • Adds on chain configs for leader reputation adapter.

Test Plan

Copy link

trunk-io bot commented Dec 15, 2023

⏱️ 10h total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-unit-tests 4h 56m 🟩🟩🟩🟩 (+5 more)
windows-build 2h 29m 🟩🟩🟩🟩🟩 (+5 more)
run-tests-main-branch 51m 🟩🟩🟩🟩🟩 (+6 more)
rust-lints 38m 🟥🟥🟥🟥🟥 (+5 more)
general-lints 27m 🟩🟩🟩🟩🟩 (+5 more)
check-dynamic-deps 19m 🟩🟩🟩🟩🟩 (+5 more)
check 11m 🟥🟥🟥🟥🟩 (+6 more)
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+5 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+6 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+5 more)
permission-check 39s 🟩🟩🟩🟩🟩 (+6 more)
permission-check 31s 🟩🟩🟩🟩🟩 (+6 more)
permission-check 29s 🟩🟩🟩🟩🟩 (+6 more)
permission-check 23s 🟩🟩🟩🟩🟩 (+5 more)
permission-check 2s 🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
windows-build 11m 18m -39%

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun changed the base branch from main to balaji/dag-event-fix December 18, 2023 20:13
@ibalajiarun ibalajiarun changed the title [dag][bugfix] properly read events after restart [dag] refactor dag health monitoring Dec 18, 2023
@ibalajiarun ibalajiarun changed the title [dag] refactor dag health monitoring [dag] refactor/improve dag health monitoring Dec 18, 2023
@ibalajiarun ibalajiarun marked this pull request as ready for review December 18, 2023 20:56
Comment on lines 437 to 439
Arc<dyn AnchorElection>,
Option<Arc<LeaderReputationAdapter>>,
Option<Vec<CommitEvent>>,
Copy link
Contributor Author

@ibalajiarun ibalajiarun Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly ugly. I couldn't cleanly separate them in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should always populate the commit events (for the sake of completeness)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it look better, we can create another trait to get voting power ratio and having a no-op one always returns 1 so here can return Arc<dyn AnchorElection>, Arc<dyn CommitHistory>, Vec<CommitEvent>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we can populate always. How many events do we read? For AnchorElectionMode::LeaderReputation we have a config with num events to read, but not for RoundRobin Leader. Also, it seems unnecessary to read in the latter case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, introduced the CommitHistory trait.

@ibalajiarun ibalajiarun force-pushed the balaji/dag-event-fix branch 2 times, most recently from 89b8b17 to 82f79bd Compare December 19, 2023 00:33
Base automatically changed from balaji/dag-event-fix to main December 19, 2023 01:06
@ibalajiarun ibalajiarun mentioned this pull request Dec 20, 2023
Comment on lines 437 to 439
Arc<dyn AnchorElection>,
Option<Arc<LeaderReputationAdapter>>,
Option<Vec<CommitEvent>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should always populate the commit events (for the sake of completeness)

Comment on lines 437 to 439
Arc<dyn AnchorElection>,
Option<Arc<LeaderReputationAdapter>>,
Option<Vec<CommitEvent>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it look better, we can create another trait to get voting power ratio and having a no-op one always returns 1 so here can return Arc<dyn AnchorElection>, Arc<dyn CommitHistory>, Vec<CommitEvent>

.unwrap_or((u64::MAX, u64::MAX));
let voting_power_ratio = self.chain_health.voting_power_ratio(round);

let max_txns_per_round = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's probably better to use [] than vec![] here

ordered_notifier.clone(),
);
let health_backoff =
HealthBackoff::new(self.epoch_state.clone(), chain_health, pipeline_health);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of these actually stops the progress fully right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, next PR stops voting.

};

let wait_time = self.minimal_wait_time.max(minimum_delay);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 163:
I think even if voting power == 3f+1 we need to wait for minimum_delay.
Currently this is not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixed it.

@@ -89,6 +93,7 @@ pub(super) struct OrderedNotifierAdapter {
parent_block_info: Arc<RwLock<BlockInfo>>,
epoch_state: Arc<EpochState>,
ledger_info_provider: Arc<RwLock<LedgerInfoProvider>>,
block_created_ts: Arc<RwLock<BTreeMap<Round, Instant>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is confusing, I think you use block ordered time instead of creation time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block is created here for the first time, so I named it that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh because dag doesn't have block, but still it's easy to confuse with jolteon block creation, I'd just call it block_ordered_ts

) -> (
Arc<dyn AnchorElection>,
Arc<dyn CommitHistory>,
Option<Vec<CommitEvent>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roughly remember we discussed this, the abstraction here looks weird. I'd imagine we have a concrete CommitHistory struct that can return CommitEvent, and AnchorElection can use the CommitHistory (impl MetadataBackend) internally for reputation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline, will follow-up

.min()
.expect("must not be empty");

let max_txns = max_txns_per_round.saturating_div(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we verify payload limit on receivers today? this seems making it impossible to verify because of different views?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a TODO

@ibalajiarun ibalajiarun enabled auto-merge (squash) February 6, 2024 03:59

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Feb 6, 2024

✅ Forge suite compat success on aptos-node-v1.8.3 ==> bf1d625b09968169673c5da6799d8c27c239c204

Compatibility test results for aptos-node-v1.8.3 ==> bf1d625b09968169673c5da6799d8c27c239c204 (PR)
1. Check liveness of validators at old version: aptos-node-v1.8.3
compatibility::simple-validator-upgrade::liveness-check : committed: 4823 txn/s, latency: 6650 ms, (p50: 5700 ms, p90: 10400 ms, p99: 20100 ms), latency samples: 183280
2. Upgrading first Validator to new version: bf1d625b09968169673c5da6799d8c27c239c204
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1772 txn/s, latency: 15665 ms, (p50: 18900 ms, p90: 21800 ms, p99: 22500 ms), latency samples: 92180
3. Upgrading rest of first batch to new version: bf1d625b09968169673c5da6799d8c27c239c204
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1409 txn/s, latency: 20505 ms, (p50: 20800 ms, p90: 27600 ms, p99: 28800 ms), latency samples: 67640
4. upgrading second batch to new version: bf1d625b09968169673c5da6799d8c27c239c204
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 2985 txn/s, latency: 9647 ms, (p50: 9800 ms, p90: 13700 ms, p99: 17500 ms), latency samples: 140320
5. check swarm health
Compatibility test for aptos-node-v1.8.3 ==> bf1d625b09968169673c5da6799d8c27c239c204 passed
Test Ok

Copy link
Contributor

github-actions bot commented Feb 6, 2024

✅ Forge suite realistic_env_max_load success on bf1d625b09968169673c5da6799d8c27c239c204

two traffics test: inner traffic : committed: 6281 txn/s, latency: 6081 ms, (p50: 5700 ms, p90: 8100 ms, p99: 15000 ms), latency samples: 2720060
two traffics test : committed: 100 txn/s, latency: 2269 ms, (p50: 2200 ms, p90: 2600 ms, p99: 5200 ms), latency samples: 1820
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.280, avg: 0.213", "QsPosToProposal: max: 0.176, avg: 0.165", "ConsensusProposalToOrdered: max: 0.621, avg: 0.595", "ConsensusOrderedToCommit: max: 0.536, avg: 0.499", "ConsensusProposalToCommit: max: 1.138, avg: 1.094"]
Max round gap was 1 [limit 4] at version 1258237. Max no progress secs was 4.987812 [limit 15] at version 1258237.
Test Ok

@ibalajiarun ibalajiarun merged commit 7c0ada5 into main Feb 6, 2024
42 checks passed
@ibalajiarun ibalajiarun deleted the balaji/backpressure branch February 6, 2024 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants