Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel calls to blob.Submit sometimes trigger broadcast tx error: account sequence mismatch #3986

Open
zvolin opened this issue Dec 6, 2024 · 10 comments
Labels
bug Something isn't working external Issues created by non node team members

Comments

@zvolin
Copy link

zvolin commented Dec 6, 2024

Celestia Node version

0.20.4

OS

nixos unstable with asahi

Install tools

docker

Others

No response

Steps to reproduce it

I'm reposting #3164 (comment) as a new issue per request from @Bidon15. Related to #3164 and #3980.

In lumina for a long time we had a lock in rpc tests that sequenced calls to blob.Submit. I tried removing it and it still quite reliably results in:

---- blob_submit_and_get stdout ----
thread 'blob_submit_and_get' panicked at rpc/tests/blob.rs:24:72:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "broadcast tx error: account sequence mismatch, expected 5, got 6: incorrect account sequence", data: None })

For reproduction

git clone [email protected]:eigerco/lumina && cd lumina
docker compose -f ci/docker-compose.yml up --build --force-recreate -d
./tools/gen_auth_tokens.sh

# remove the mentioned lock
patch -p1 <<"EOF"
diff --git a/rpc/tests/utils/client.rs b/rpc/tests/utils/client.rs
index 55beb7d..1e7681c 100644
--- a/rpc/tests/utils/client.rs
+++ b/rpc/tests/utils/client.rs
@@ -55,6 +55,6 @@ pub async fn blob_submit<C>(client: &C, blobs: &[Blob]) -> Result<u64, ClientErr
 where
     C: SubscriptionClientT + Sync,
 {
-    let _guard = write_lock().await;
+    // let _guard = write_lock().await;
     client.blob_submit(blobs, TxConfig::default()).await
 }
EOF

# run rpc tests
cargo test -p celestia-rpc

After it reproduces, all subsequent calls fail with this error

Expected result

no errors

Actual result

---- blob_submit_and_get stdout ----
thread 'blob_submit_and_get' panicked at rpc/tests/blob.rs:24:72:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "broadcast tx error: account sequence mismatch, expected 5, got 6: incorrect account sequence", data: None })

Relevant log output

No response

Is the node "stuck"? Has it stopped syncing?

No response

Notes

No response

@zvolin zvolin added the bug Something isn't working label Dec 6, 2024
@github-actions github-actions bot added the external Issues created by non node team members label Dec 6, 2024
@Wondertan
Copy link
Member

@zvolin, do you mind testing if #3980 is indeed fixing the issue or is a seperate one

@zvolin
Copy link
Author

zvolin commented Dec 6, 2024

sure. It mostly fixes parallel submission, the sequence error is obviously gone. It uncovers some other issue tho, which seems to happen much less frequently:

called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "transaction with hash AC0CD98DCDB5101CE4DA24983448BC5A9850339A16CD472243D9AC08F36EDB26 not found; it was likely rejected", data: None })

@zvolin
Copy link
Author

zvolin commented Dec 6, 2024

looking at the code there I think it's more of a workaround than a fix. Will it still fire up when using a grpc to communicate with app node directly?

@cmwaters
Copy link
Contributor

cmwaters commented Dec 6, 2024

Do you know how the other transactions on other threads fail? If it's all nonce mismatch that's odd because there a lock on the server side. However, if one fails for some other reason (not enough gas) then all subsequent transactions will fail for nonce error

@i1i1
Copy link

i1i1 commented Dec 6, 2024

I've also hit the same issue. Here is the code to reproduce: https://gist.github.com/i1i1/334739168b97242e1ba3812872ecff00

I also sometimes hit the error:

transaction with hash AC0CD98DCDB5101CE4DA24983448BC5A9850339A16CD472243D9AC08F36EDB26 not found; it was likely rejected

Also tried the version with fix #3980 and now it always hits the error with transaction being rejected.

@zvolin
Copy link
Author

zvolin commented Dec 6, 2024

However, if one fails for some other reason (not enough gas) then all subsequent transactions will fail for nonce error

Honestly this sounds like an issue on its own. Will check that tomorrow in my repro, but doesn't seem to be the case in what @i1i1 shown

@zvolin
Copy link
Author

zvolin commented Dec 9, 2024

Some other error that happened when running with #3980

---- blob_submit_and_get_all stdout ----
thread 'blob_submit_and_get_all' panicked at rpc/tests/blob.rs:65:67:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "broadcast tx error: signature verification failed; please verify account number (7) and chain-id (private): unauthorized", data: None })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@zvolin
Copy link
Author

zvolin commented Dec 9, 2024

Do you know how the other transactions on other threads fail?

So far I haven't seen any issue wrt gas, here is an output with other threads

failures:

---- blob_subscribe stdout ----
thread 'blob_subscribe' panicked at rpc/tests/blob.rs:134:70:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "broadcast tx error: account sequence mismatch, expected 11, got 13: incorrect account sequence", data: None })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- blob_get_get_proof_wrong_ns stdout ----
thread 'blob_get_get_proof_wrong_ns' panicked at rpc/tests/blob.rs:178:72:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "transaction with hash 71951AADE0670847FEB01D2242EB568E881E0678757CC001867FBDCF96CF6661 not found; it was likely rejected", data: None })

---- blob_submit_and_get_large stdout ----
thread 'blob_submit_and_get_large' panicked at rpc/tests/blob.rs:97:72:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: ServerError(1), message: "transaction with hash 3ABF61428FAE28C7816A6852CF559915B159C2542F78CBD4089808127E406B31 not found; it was likely rejected", data: None })


failures:
    blob_get_get_proof_wrong_ns
    blob_submit_and_get_large
    blob_subscribe

@cmwaters
Copy link
Contributor

So it looks like there is a concurrency related issue but not with the TxClient on the client side but within the mempool of the consensus node (i.e. in the server itself). More info can be found here: celestiaorg/celestia-core#1552.

I will look to provide a patch

@zvolin
Copy link
Author

zvolin commented Dec 19, 2024

but not with the TxClient on the client side but within the mempool of the consensus node

So I think there is still some issue on the client side, not sure if in celestia-app or node (likely in the app tho). I just wrote some initial TxClient for rust and it has no issues when submitting hundreds of tx's in parallel, while I can easily reproduce it with rpc calls into celestia-node. Both clients are running with celestia-app 3.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Issues created by non node team members
Projects
None yet
Development

No branches or pull requests

4 participants