Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Aptos CLI node stops generating blocks randomly #15630

Open
fabio-lameira-anchorage opened this issue Dec 18, 2024 · 10 comments
Open

[Bug] Aptos CLI node stops generating blocks randomly #15630

fabio-lameira-anchorage opened this issue Dec 18, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@fabio-lameira-anchorage
Copy link

fabio-lameira-anchorage commented Dec 18, 2024

🐛 Bug

I am experiencing an issue where the aptos-cli node intermittently stops generating blocks during CI testing, causing the tests to fail. This issue occurs randomly and without any obvious error messages in the logs.

To reproduce

Code snippet to reproduce
I am currently running the latest Aptos-cli v5.1.0 on Ubuntu 24.10 within a Docker container.

Node Initialization Command:

aptos node run-local-testnet --log-to-stdout --test-dir /data/regnet --bind-to 0.0.0.0

I have tried using the --performance flag, but the issue persists.
Also added the following env variables to see more logs and disable the telemetry:

- 'RUST_LOG=trace'
- 'APTOS_DISABLE_TELEMETRY=true'

Stack trace/error message
The logs from one of the times it stopped generating blocks are attached below.
logs.txt

As you can see from the logs, the node stopped generating blocks at 2024-12-18T04:36:28.072056812Z (last version: 1217).

2024-12-18T04:36:28.072056812Z {"timestamp":"2024-12-18T04:36:28.071704Z","level":"INFO","message":"[Indexer Table Info] Processed successfully","start_version":1216,"end_version":1217,"num_transactions":2,"duration_in_secs":1.022302945,"service_type":"table_info_service","step":"2","filename":"ecosystem/indexer-grpc/indexer-grpc-utils/src/counters.rs","line_number":263,"threadName":"table-info-1","threadId":"ThreadId(193)"}

There are no apparent errors or warnings in the logs that indicate the cause of this issue.

Expected Behavior

The expected behavior is that it never stops generating blocks.

System information

Please complete the following information:

Additional context

In the dockerfile I am downloading the aptos-cli-5.1.0-Ubuntu-x86_64.zip file and extracting it with the following line.

RUN curl --retry 3 --retry-connrefused -sL -o $APTOS_CLI_ZIP "https://github.com/aptos-labs/aptos-core/releases/download/aptos-cli-v${APTOS_CLI_RELEASE}/aptos-cli-${APTOS_CLI_RELEASE}-Ubuntu-x86_64.zip" && \
    echo "${APTOS_CLI_DIGEST} ${APTOS_CLI_ZIP}" | sha256sum -c --status && \
    unzip -d $BIN $APTOS_CLI_ZIP && \
    rm $APTOS_CLI_ZIP

I can provide more info about the current setup if needed.
I would appreciate any assistance in troubleshooting this issue.
Thank you for your time 🙏

@fabio-lameira-anchorage fabio-lameira-anchorage added the bug Something isn't working label Dec 18, 2024
@vineethk
Copy link
Contributor

@fabio-lameira-anchorage Is there a CLI version where you get the expected behavior? (If I understand correctly, you are observing the same behavior with aptos CLI versions: 5.1.0, 5.0.0, 4.7.0, 4.6.0, 4.5.0).

@fabio-lameira-anchorage
Copy link
Author

fabio-lameira-anchorage commented Dec 18, 2024

We only started using aptos-cli in v4.1.0 and although in that version it happened from time to time, it was not as frequent as in most recent versions. It seems to have gotten worse in most recent versions.

I saw some PRs related to deadlocks and I thought the solution would be in the latest versions but it seems not so as our CI remains flaky due to this problem.

@vusirikala
Copy link
Contributor

Thank you @fabio-lameira-anchorage for raising this issue. I don't have a linux. I tried to reproduce the issue on my mac. In the above logs, the CLI failed after 1200 transactions. In my mac, it ran over 10k transactions so far, but still no issue. I will investigate more. In the meanwhile, if it is possible, do you mind try using mac instead of linux in your CI?

@JoshLind
Copy link
Contributor

Thanks for reporting @fabio-lameira-anchorage. 😄 Couple questions:

  1. Could you also upload the logs of the validator node during a failed run? Looking through the logs you provided, the validator log location should be displayed like so:
2024-12-18T04:25:48.138994308Z 	Log file: "/data/regnet/validator.log"
  1. When blocks stop being generated, can you ping the API service on the validator node? If you ping the endpoint (e.g., localhost:8080/v1), you can see if the latest version is increasing or not. If the validator endpoint continues to make progress, it indicates the issue is elsewhere. The endpoint location is usually displayed like so:
2024-12-18T04:25:48.139010691Z 	REST API endpoint: http://0.0.0.0:8080

@fabio-lameira-anchorage
Copy link
Author

fabio-lameira-anchorage commented Dec 19, 2024

Thank you @fabio-lameira-anchorage for raising this issue. I don't have a linux. I tried to reproduce the issue on my mac. In the above logs, the CLI failed after 1200 transactions. In my mac, it ran over 10k transactions so far, but still no issue. I will investigate more. In the meanwhile, if it is possible, do you mind try using mac instead of linux in your CI?

Hello @vusirikala 👋 Thanks for looking into this issue.

Running locally I was also never able to reproduce the problem. I even left the node running for more than 10 hours several times and it never stopped.

The problem becomes apparent when we run different docker containers in a Pipeline, each container running its own node, and the CI tests calling the nodes to do funds, transfers and scan transactions.

In this Pipeline we launch 6 instances of the node, each one in its own docker container and normally of these 6 there is always 1 or 2 that stops generating blocks and causes the pipeline to fail.

Furthermore, we see no relationship between time or number of transactions. Sometimes it stops generating blocks at version 900, other times it stops after having generated 2000 versions. I didn't notice any specific pattern.

As for the Linux question, I don't know if it will be simple for us to configure and run a docker container in our CI that has macos as the base image. Anyway, I've already tested using the two Ubuntu images you make available (aptos-cli-5.1.0-Ubuntu-22.04-x86_64.zip and aptos-cli-5.1.0-Ubuntu-x86_64.zip) with Ubuntu 22.04, Ubuntu 20.04 and Debian 12.6 and I think it's not OS related.

We also thought it could be a resource problem, but by monitoring the agents that run the containers we didn't find anything abnormal in the consumption of RAM, CPU or disk. For context, we saw the same problem running the containers on machines with 16GB and 32GB of RAM.

@fabio-lameira-anchorage
Copy link
Author

Thanks for reporting @fabio-lameira-anchorage. 😄 Couple questions:

  1. Could you also upload the logs of the validator node during a failed run? Looking through the logs you provided, the validator log location should be displayed like so:
2024-12-18T04:25:48.138994308Z 	Log file: "/data/regnet/validator.log"
  1. When blocks stop being generated, can you ping the API service on the validator node? If you ping the endpoint (e.g., localhost:8080/v1), you can see if the latest version is increasing or not. If the validator endpoint continues to make progress, it indicates the issue is elsewhere. The endpoint location is usually displayed like so:
2024-12-18T04:25:48.139010691Z 	REST API endpoint: http://0.0.0.0:8080

Hello @JoshLind 👋

I will not be able to get the validator logs for the same failed run as the logs I uploaded above, as we are not exporting that file in the build artifacts. But I can try to get the validator logs for a new run which will fail eventually.

I don't see a way to ping the node when it stops generating blocks because what I do is launch multiple docker containers in CI and only notice that one of them has stopped when the pipeline fails.
At this point it is too late because when the pipeline fails, the container shuts down. If I could reproduce the error locally, it would be easier to ping the node when I realized that it had stopped generating blocks.

I'll try to get the validator logs from a new run and then I will upload them here.

Thanks for looking into this 🙏

@fabio-lameira-anchorage
Copy link
Author

fabio-lameira-anchorage commented Dec 19, 2024

Hello again @JoshLind 👋

The only way I was able to get the logs was to write validator.log to the container's stdout.
Below I attach the new log file from a new failed run that contains both the node and validator logs, mixed in the same file.

I had to break the file into 2 because github won't let me upload files larger than 25mb:
logs_with_validator_1.txt
logs_with_validator_2.txt

It is possible to see in the logs (searching for [Indexer Table Info]) that the last block generated was at 2024-12-19T15:45:46.060561028Z (version 379) and from then on a new block was never generated again.

Despite that, the node continued running for another 20 minutes without generating a new block.

2024-12-19T15:45:46.060561028Z {"timestamp":"2024-12-19T15:45:46.060431Z","level":"INFO","message":"[Indexer Table Info] Processed successfully","start_version":378,"end_version":379,"num_transactions":2,"duration_in_secs":1.028596506,"service_type":"table_info_service","step":"2","filename":"ecosystem/indexer-grpc/indexer-grpc-utils/src/counters.rs","line_number":263,"threadName":"table-info-0","threadId":"ThreadId(256)"}

After this last block is generated we see several validator logs referring to the most recent version, but we never see logs referring to a new block again. I also noticed this validator log which I found interesting:

2024-12-19T15:45:47.122442108Z 2024-12-19T15:45:47.122281Z [consensus-3] WARN consensus/src/round_manager.rs:427 Error generating and sending proposal: Fail to retrieve payload

Hope this helps debug the issue.
If you need anything else, please let me know.

Thank you for your time 🙏

@JoshLind
Copy link
Contributor

JoshLind commented Dec 24, 2024

Thanks @fabio-lameira-anchorage!

Tagging @zekun000 to take a look at the logs (specifically, logs_with_validator_2.txt). I do see some weird traces -- looks like consensus falls back to state sync, which seems like a bug, as this is a single validator deployment, so state sync shouldn't be invoked. 🤔

@fabio-lameira-anchorage
Copy link
Author

Hi @JoshLind 👋 Happy New Year!

Just checking in on this thread for any updates.
Let me know if I can help with anything.

Thanks for your help 🙏

@JoshLind
Copy link
Contributor

JoshLind commented Jan 14, 2025

Hey @fabio-lameira-anchorage 😄 Apologies for the delay. Folks are catching up and looking at internally. Will let you know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants
@vineethk @JoshLind @vusirikala @fabio-lameira-anchorage and others