Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage] fix the replay-verify stuck #10930

Merged
merged 1 commit into from
Nov 28, 2023
Merged

Conversation

areshand
Copy link
Contributor

@areshand areshand commented Nov 15, 2023

Description

The problem is the 2nd retry was the same process with old cache or lock on DB. Now, do the retry outside and can still leverage the resumable replay without redundant work。

using old versioned node cache will cause all non-exe thread panic and stuck forever.
also the lock on ledger db cannot be dropped even when the AptosDb goes out of scope in the 2nd iteration.

Test Plan

https://github.com/aptos-labs/aptos-core/actions/runs/6937303262

@areshand areshand force-pushed the bowu/handle_stuck_partition branch 2 times, most recently from 3fde956 to 4e9d32c Compare November 15, 2023 23:02
@areshand areshand requested a review from grao1991 November 15, 2023 23:56
@areshand areshand force-pushed the bowu/handle_stuck_partition branch from 4e9d32c to cbea735 Compare November 16, 2023 00:03
@areshand areshand marked this pull request as ready for review November 16, 2023 00:04
@areshand areshand force-pushed the bowu/handle_stuck_partition branch 11 times, most recently from 44e36f2 to 30a54ae Compare November 21, 2023 20:28
@areshand areshand enabled auto-merge (rebase) November 22, 2023 19:30

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

testsuite/replay_verify.py Outdated Show resolved Hide resolved
storage/db-tool/src/replay_verify.rs Outdated Show resolved Hide resolved
@areshand areshand force-pushed the bowu/handle_stuck_partition branch 2 times, most recently from fbaa0ce to e342092 Compare November 27, 2023 23:28

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@areshand areshand force-pushed the bowu/handle_stuck_partition branch from e342092 to e61afc9 Compare November 28, 2023 00:32

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@areshand areshand requested a review from msmouse November 28, 2023 01:09
while i < NUM_OF_RETRIES:
i += 1
(partition_number, code, msg) = func(*args, **kwargs)
print(f"[partition {partition_number}] trying {i}th time")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first print will be "2th"? 😂

And I mean.. res doesn't need to be kept across iterations if we add the return under the code == 1 branch. (updated my previous comment to use while True:, which makes more sense.

Copy link
Contributor Author

@areshand areshand Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some poetry code inspection in CI/CD and enforce there is a return from this function, since it doesn't know if code == 1 would eventually return. updated i = 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay.. then

  1. print("try {i}") should be before func(...)
  2. looks nicer to do:
(partition_number, code, msg) = (None, None, None)

for i in range(1, NUM_TRIES+1):
    print("try {i}")
    (partition_number, code, msg) = func(*args, **kwargs)
    if code != 1:
        break

return (partition_number, code, msg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks beautify

Copy link
Contributor

@msmouse msmouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental approval, sorry.

@areshand areshand force-pushed the bowu/handle_stuck_partition branch from e61afc9 to 95f060e Compare November 28, 2023 01:28
@areshand areshand requested a review from msmouse November 28, 2023 01:30

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@areshand areshand force-pushed the bowu/handle_stuck_partition branch from 95f060e to 90bc3c6 Compare November 28, 2023 02:01

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.7.3 ==> 90bc3c610554c4dad55b9b81a09f717c1d89580c

Compatibility test results for aptos-node-v1.7.3 ==> 90bc3c610554c4dad55b9b81a09f717c1d89580c (PR)
1. Check liveness of validators at old version: aptos-node-v1.7.3
compatibility::simple-validator-upgrade::liveness-check : committed: 4994 txn/s, latency: 6474 ms, (p50: 6300 ms, p90: 9800 ms, p99: 13900 ms), latency samples: 189780
2. Upgrading first Validator to new version: 90bc3c610554c4dad55b9b81a09f717c1d89580c
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1786 txn/s, latency: 15808 ms, (p50: 19300 ms, p90: 21800 ms, p99: 22500 ms), latency samples: 92880
3. Upgrading rest of first batch to new version: 90bc3c610554c4dad55b9b81a09f717c1d89580c
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1837 txn/s, latency: 15899 ms, (p50: 19200 ms, p90: 22000 ms, p99: 22300 ms), latency samples: 91880
4. upgrading second batch to new version: 90bc3c610554c4dad55b9b81a09f717c1d89580c
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3605 txn/s, latency: 8704 ms, (p50: 9900 ms, p90: 11400 ms, p99: 12100 ms), latency samples: 144200
5. check swarm health
Compatibility test for aptos-node-v1.7.3 ==> 90bc3c610554c4dad55b9b81a09f717c1d89580c passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 90bc3c610554c4dad55b9b81a09f717c1d89580c

two traffics test: inner traffic : committed: 7500 txn/s, latency: 5209 ms, (p50: 5100 ms, p90: 6300 ms, p99: 11400 ms), latency samples: 3255200
two traffics test : committed: 100 txn/s, latency: 2403 ms, (p50: 2300 ms, p90: 2800 ms, p99: 4000 ms), latency samples: 1800
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.204, avg: 0.196", "QsPosToProposal: max: 0.208, avg: 0.177", "ConsensusProposalToOrdered: max: 0.691, avg: 0.649", "ConsensusOrderedToCommit: max: 0.577, avg: 0.545", "ConsensusProposalToCommit: max: 1.266, avg: 1.194"]
Max round gap was 1 [limit 4] at version 1562403. Max no progress secs was 3.987601 [limit 10] at version 1562403.
Test Ok

@areshand areshand merged commit 2905166 into main Nov 28, 2023
62 of 65 checks passed
@areshand areshand deleted the bowu/handle_stuck_partition branch November 28, 2023 02:30
Copy link
Contributor

✅ Forge suite framework_upgrade success on aptos-node-v1.7.3 ==> 90bc3c610554c4dad55b9b81a09f717c1d89580c

Compatibility test results for aptos-node-v1.7.3 ==> 90bc3c610554c4dad55b9b81a09f717c1d89580c (PR)
Upgrade the nodes to version: 90bc3c610554c4dad55b9b81a09f717c1d89580c
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 4215 txn/s, latency: 4818 ms, (p50: 4800 ms, p90: 7800 ms, p99: 8400 ms), latency samples: 244520
5. check swarm health
Compatibility test for aptos-node-v1.7.3 ==> 90bc3c610554c4dad55b9b81a09f717c1d89580c passed
Test Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants