Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Mainnet state exported localnet requires second node after upgrading to SDK 47 #17078

Closed
sampocs opened this issue Jul 20, 2023 · 3 comments
Labels

Comments

@sampocs
Copy link
Contributor

sampocs commented Jul 20, 2023

Context

For each Stride upgrade, the upgrade is tested on a mainnet-state-exported local network in docker (docs). Osmosis does the same (which is where localstride is borrowed from), and I assume there are many other chains that follow a similar testing process.

Summary of Bug

When testing the upgrade from SDK 46 to 47, the upgrade passed successfully, but then the network was halted immediately after. The solution wound up being that we needed to start up a 2nd node and peer it together with the first, which was able to jump start the first node. After blocks started churning again, we were able to turn off the 2nd node.

Considering we've run this mainnet-state-exported upgrade process on all prior upgrades without seeing this issue, I'm led to believe it's something related to SDK 47.

Version

Upgrading from v0.46.7 to v0.47.3

Steps to Reproduce

The steps to reproduce are a bit complex. It involves following this guide, and starting with stride binary version v9.2.1 and upgrading to version v10.0.0.

I'm mostly posting this for awareness to other teams that test their upgrades with mainnet state exported testnets. That said, if you would like to debug this, I'm happy to hop on a call and walk through the setup to reproduce!

@sampocs sampocs added the T:Bug label Jul 20, 2023
@alexanderbez
Copy link
Contributor

Hi @sampocs, thanks for posting! Curious, what makes you think a 2nd node was needed? Obviously it worked after the 2nd node was started, but indicated that this was necessary?

What were the logs from the 1st node? Was it stuck? Was it trying to produce a block?

@sampocs
Copy link
Contributor Author

sampocs commented Jul 20, 2023

Unfortunately I don't have the logs handy anymore 😞. But it was producing endless p2p logs (which was the first hint).

The logs did show that the upgrade was successful and I added logs to the begin/end blocker that showed it completed the block that corresponded to the upgrade height. So this gave me the hunch that the issue was not related to any specific upgrade handler code and was likely a networking issue (this was later confirmed as this upgrade was successful on mainnet).

Truth be told, the real idea for adding a 2nd node came from an Osmosis engineer who I was debugging this with. Back in the day, he used to always have to add a 2nd node to these mainnet upgrade tests, before he realized that if he could just disable fast sync instead. But in our case, fast sync was already disabled, and adding a 2nd node was more just a last ditch guess since nothing else we had tried could get it to run 😂 . And neither of us have any guess at why this solved it

Apologies, I know that's not super helpful!

@tac0turtle
Copy link
Member

this is an open issue on CometBFT, we dont have anything we can do in the sdk here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants