-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Aptos CLI node stops generating blocks randomly #15630
Comments
@fabio-lameira-anchorage Is there a CLI version where you get the expected behavior? (If I understand correctly, you are observing the same behavior with aptos CLI versions: 5.1.0, 5.0.0, 4.7.0, 4.6.0, 4.5.0). |
We only started using aptos-cli in v4.1.0 and although in that version it happened from time to time, it was not as frequent as in most recent versions. It seems to have gotten worse in most recent versions. I saw some PRs related to deadlocks and I thought the solution would be in the latest versions but it seems not so as our CI remains flaky due to this problem. |
Thank you @fabio-lameira-anchorage for raising this issue. I don't have a linux. I tried to reproduce the issue on my mac. In the above logs, the CLI failed after 1200 transactions. In my mac, it ran over 10k transactions so far, but still no issue. I will investigate more. In the meanwhile, if it is possible, do you mind try using mac instead of linux in your CI? |
Thanks for reporting @fabio-lameira-anchorage. 😄 Couple questions:
|
Hello @vusirikala 👋 Thanks for looking into this issue. Running locally I was also never able to reproduce the problem. I even left the node running for more than 10 hours several times and it never stopped. The problem becomes apparent when we run different docker containers in a Pipeline, each container running its own node, and the CI tests calling the nodes to do funds, transfers and scan transactions. In this Pipeline we launch 6 instances of the node, each one in its own docker container and normally of these 6 there is always 1 or 2 that stops generating blocks and causes the pipeline to fail. Furthermore, we see no relationship between time or number of transactions. Sometimes it stops generating blocks at version 900, other times it stops after having generated 2000 versions. I didn't notice any specific pattern. As for the Linux question, I don't know if it will be simple for us to configure and run a docker container in our CI that has macos as the base image. Anyway, I've already tested using the two Ubuntu images you make available (aptos-cli-5.1.0-Ubuntu-22.04-x86_64.zip and aptos-cli-5.1.0-Ubuntu-x86_64.zip) with Ubuntu 22.04, Ubuntu 20.04 and Debian 12.6 and I think it's not OS related. We also thought it could be a resource problem, but by monitoring the agents that run the containers we didn't find anything abnormal in the consumption of RAM, CPU or disk. For context, we saw the same problem running the containers on machines with 16GB and 32GB of RAM. |
Hello @JoshLind 👋 I will not be able to get the validator logs for the same failed run as the logs I uploaded above, as we are not exporting that file in the build artifacts. But I can try to get the validator logs for a new run which will fail eventually. I don't see a way to ping the node when it stops generating blocks because what I do is launch multiple docker containers in CI and only notice that one of them has stopped when the pipeline fails. I'll try to get the validator logs from a new run and then I will upload them here. Thanks for looking into this 🙏 |
Hello again @JoshLind 👋 The only way I was able to get the logs was to write validator.log to the container's stdout. I had to break the file into 2 because github won't let me upload files larger than 25mb: It is possible to see in the logs (searching for Despite that, the node continued running for another 20 minutes without generating a new block.
After this last block is generated we see several validator logs referring to the most recent version, but we never see logs referring to a new block again. I also noticed this validator log which I found interesting:
Hope this helps debug the issue. Thank you for your time 🙏 |
Thanks @fabio-lameira-anchorage! Tagging @zekun000 to take a look at the logs (specifically, |
Hi @JoshLind 👋 Happy New Year! Just checking in on this thread for any updates. Thanks for your help 🙏 |
Hey @fabio-lameira-anchorage 😄 Apologies for the delay. Folks are catching up and looking at internally. Will let you know! |
🐛 Bug
I am experiencing an issue where the aptos-cli node intermittently stops generating blocks during CI testing, causing the tests to fail. This issue occurs randomly and without any obvious error messages in the logs.
To reproduce
Code snippet to reproduce
I am currently running the latest Aptos-cli v5.1.0 on Ubuntu 24.10 within a Docker container.
Node Initialization Command:
I have tried using the
--performance
flag, but the issue persists.Also added the following env variables to see more logs and disable the telemetry:
Stack trace/error message
The logs from one of the times it stopped generating blocks are attached below.
logs.txt
As you can see from the logs, the node stopped generating blocks at
2024-12-18T04:36:28.072056812Z
(last version:1217
).There are no apparent errors or warnings in the logs that indicate the cause of this issue.
Expected Behavior
The expected behavior is that it never stops generating blocks.
System information
Please complete the following information:
Additional context
In the dockerfile I am downloading the aptos-cli-5.1.0-Ubuntu-x86_64.zip file and extracting it with the following line.
I can provide more info about the current setup if needed.
I would appreciate any assistance in troubleshooting this issue.
Thank you for your time 🙏
The text was updated successfully, but these errors were encountered: