Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf: buffer underflow #729

Closed
activenodes opened this issue May 26, 2023 · 72 comments · Fixed by #903
Closed

Protobuf: buffer underflow #729

activenodes opened this issue May 26, 2023 · 72 comments · Fixed by #903

Comments

@activenodes
Copy link

activenodes commented May 26, 2023

Chains with:

Every ~40 sigratures (softsign due to low time blocks) connection go down with this error:

  • protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow

Let me know if you need any more details.

Thanks @tony-iqlusion

@tony-iqlusion
Copy link
Member

What is on the other side of the privval connection? You're the first to report an error like this in many years of TMKMS existing, so it seems like the other side is the likely culprit.

@activenodes
Copy link
Author

activenodes commented May 26, 2023

It's a testnet (Sei Network).
Other 3/4 guys can confirm the same issue (like @mkaczanowski)

@imperator-co
Copy link

Hi, I can confirm that we have the same problem, afaik "buffer underflow" indicates an attempt to read a message that is larger than the amount of data available in the buffer (on tmkms side)

@tony-iqlusion
Copy link
Member

It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.

@activenodes
Copy link
Author

It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.

I'm not a Horcrux user, but it works fine in this chain too.
What can we provide you to analyze the cause?

@tony-iqlusion
Copy link
Member

I need a much better description of the problem than this, and instructions on how to reproduce it. Ideally it would be nice to have a dump of the incoming Protobuf which TMKMS can't parse.

What chain are you running?

How did you install it?

How are you running it?

Which message is causing the problem? Can you link to the Protobuf Schema for it?

@qezz
Copy link

qezz commented May 30, 2023

Expanding a bit on this issue, we, too, faced this issue on Sei testnet (atlantic-2).

From what I understand, Sei has its own fork of tendermint - https://github.com/sei-protocol/sei-tendermint. So I believe the schema is https://github.com/sei-protocol/sei-tendermint/tree/main/proto, but I'm not 100% sure.

As for us, we were running tmkms as usual, as a separate process, and pointing seid (https://github.com/sei-protocol/sei-chain) to it via config parameter (which sets unix socket). The mode was softsign.

@tony-iqlusion
Copy link
Member

I'm curious if they're expecting to use gRPC for the privval connection: https://github.com/sei-protocol/sei-tendermint/tree/main/privval

We don't support that yet: #73. We've been waiting for it to be added to upstream Tendermint, since it was removed in v0.35.

What version of Horcrux is it working with?

@activenodes
Copy link
Author

My idea is to prepare you an environment with everything configured (chain/validator/tmkms) and give you access.
Can you be interested?

@tony-iqlusion
Copy link
Member

That would be very helpful, thank you

@activenodes
Copy link
Author

That would be very helpful, thank you

I'm getting that server ready for you.
Where I can send you logins (SSH)?
If you want to contact me on Discord (Simo | Active Nodes#3233) or give me an email.
Thanks!

@tony-iqlusion
Copy link
Member

I will attempt to contact you via Discord

@activenodes
Copy link
Author

I will attempt to contact you via Discord

Changed permissions... Could you try again with Discord (Simo | Active Nodes#3233)?
I'm not using Signal...
Thanks

@tony-iqlusion
Copy link
Member

Request sent

@IbrarMakaveli
Copy link

IbrarMakaveli commented Aug 14, 2023

Hey @tony-iqlusion , Any update on this issue ? Let me know how I can help to deal with, thanks

@activenodes
Copy link
Author

Note. regarding previous tests and SSH access, now devnet is suspended. So we have to use testnet or mainnet chain

@tony-iqlusion
Copy link
Member

tony-iqlusion commented Aug 16, 2023

I haven't had time to look, sorry. Note that there's a Hub upgrade today and a Stride upgrade tomorrow so I don't have much time this week either.

What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.

@activenodes
Copy link
Author

What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.

Should we increase the debug level?
Or how to collect as many logs as possible?

Thanks

@tony-iqlusion
Copy link
Member

You can try adding a dbg!(&msg) here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36

Out of curiosity, do you have protocol_version = "0.34" in the config?

@activenodes
Copy link
Author

activenodes commented Aug 17, 2023

You can try adding a dbg!(&msg) here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36

Thanks, I will try later in testnet

Out of curiosity, do you have protocol_version = "0.34" in the config?

Yes protocol_version = "v0.34" (with the "v") ... I think I have all the chains like this. is it right?

@activenodes
Copy link
Author

Here: https://pastebin.com/4wDMbwP3

ERROR tmkms::client: [atlantic-2@tcp://127.0.0.1:26658] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow`

@tony-iqlusion
Copy link
Member

Thanks, I believe this should be the message which is causing the issue encoded in hex:

ba2b2ab72b0aa82b08201091a6cc0d20ffffffffffffffffff012a480a20a33e4740e0b974cf1e010e6794ff620487e52ad63718d77e46ba3a62e1adc406122408011220cf19b022e389f5cd701b17b800caf505171bb98ae967349da72513b2138c1820320c08f1fbf8a60610a7bab0920242220a20bb985439aebaf357fad861849528d82e763031abd79fcc87c3a305bc1485388442220a2039825ada697b2d6959d75fa711636c7627c450c87ed8fe1aeaba4f20e28d435042220a20d9c22318e69463191372f92f66e5bdbf859c9b0e52020753785df5f53ea4bacb42220a20fc7ae857c36f4ea2f56285281e283ccdd600f4934183574bf5fd0ebb4c278cf942220a20ab49231edaa3fec0921d42b85cf90ba61d5251144980fa16f62a842a201755ad42220a20a7a17a651bfc02228490723a752a96e9f8ac3d64d261377726e2cdb0ccc646e342220a203178335d0ebe7e0ce2905df5b23aa05b9295e09caf80ba45472d2f61ccc9fb2f42220a20887cddac11b90e7a71888fefec183600dd79b94e9c8c43fb24d66038970eb91242220a202bfaf4b02410e8ba736e0443d63543e855cb4fb1eb99522a66d764802392700e42220a209a3e9332f29019d03097bcf290f377939b6bf8ee6a9d1952920cb9433515ca9542220a20dd148712e5a43ca35a0ed3560d6e753598da4e2ff71bd16d9dd553eddc4adcde42220a207e1df69adb7f32259154e375e20e98cf0a1c95b9f78ef1349df3ab539ab615be42220a2070a1e82287b3775ab287f7fe4e803c9bce2b03cbb4d0dba83b86104a114cb6e542220a207783770f1b6f8bdb347a8a8f113ecc67842783b4afbebb608aced8a51bb29bce42220a2008f7c95794aec249d2eba33161199c28574bc77f13accff28e5d70c23c3ebbd642220a20b53a7d779ebd7d0778aa9bb66e4565fe80903cfac23268ec016f2cf00818be9042220a20fa2f7fe3daa69c0cdd44058a44734bb6d44fce3e0c4db585e77a5190869137a142220a20f26cd81b78fa41bd46df409302c976f3451b3f0065a0e1e3e8dcb9947fd7d07c42220a20d002698eebbfd05a20492eaae6ed769068fac6cf0f3a55b6869ae8cacf5b92bf42220a20c1a848f44dfc893511262c49dd48072eb50472d49398c44b27044a028c76113b42220a2089c0affd2d4c2151cc654eebec6f0d29302ed611664965616e534e3b854eb43f42220a2084df4847e6d5998ac091ff0d6905400c3c7264855277e88cf06a184b81d2676b42220a2010e5cb10974a839fdd1584f1efc256d20fbf9d710c40ac4b968e23cb4927125b42220a20b12308f12fac8ecdc13b84feba49bf617ae0fb084b8d5fa8e8c975ee7f38c7954a0052ad200890a6cc0d1a480a203dee68aaa35b98433eba6c0416ec13403e50e3a2afb6ed1beaca515bbd7a98fc

I'll try to take a look and see if I can figure out what it's supposed to be

@tony-iqlusion
Copy link
Member

If I plug that string into: https://protobuf-decoder.netlify.app/

...it doesn't appear to have a valid field number, which would need to match one of these field types (1 - 8):

https://docs.rs/tendermint-proto/latest/src/tendermint_proto/prost/v0_38/tendermint.privval.rs.html#79-96

...but the decoder says if that is a protobuf, its field number is 695, which would be a very unusual field number for a proto (they start at 1 and generally don't much higher than 30 typically)

So I have no idea what that message is supposed to be or how/why Horcrux would support it because it doesn't seem to be encoded correctly as a proto as far as I can tell.

It's possible Horcrux simply ignores messages it doesn't understand, whereas TMKMS considers it an error.

@tony-iqlusion
Copy link
Member

Does anyone happen to know what that message might be, or know where to ask about it?

@IbrarMakaveli
Copy link

I don't really know too, if you could ignore this message, and deploy the code on a new branch, I could test and compile the tmkms on a machine and do the test on the testnet.
We could ask Sei dev, but I'm not sure they know what it means either.

@mkaczanowski
Copy link
Contributor

mkaczanowski commented Sep 17, 2023

I found the issue and our (Chorus One) testnet node runs with the patched / hacked tmkms version and everything seems to be fine.

TL;DR

The issue is that the serialized block proposal (SIGNED_MSG_TYPE_PROPOSAL) for weights from 4000 bytes to 6000 (eyeballed logs with additional log messages). The buffer defined below is 1024 bytes only:
https://github.com/iqlusioninc/tmkms/blob/main/src/rpc.rs#L197

resulting in message being always cut, thus malformed.

Proof of concept fix

I simply bumped the buffer size to:

let mut buf = vec![0; DATA_MAX_SIZE * 10];

Proper fix

Now I wonder whether this limit is actually correct in this context:
https://docs.rs/tendermint-p2p/latest/tendermint_p2p/secret_connection/constant.DATA_MAX_SIZE.html

because this is not strictly p2p message (between peers in tendermint network), but rather a node <-> signer relationship.

Now the question is how to fix it, shall I make the buffer len configurable or hardcode higher value, or does anyone has a better idea?

another option to fix it, could be simply using dynamic allocation via Vec::new():
https://doc.rust-lang.org/std/io/trait.Read.html#method.read_to_end

@tony-iqlusion could you share your thoughts here?

@tony-iqlusion
Copy link
Member

@mkaczanowski sounds like a bug/issue in the tendermint-p2p crate. I would suggest opening an upstream issue on tendermint-rs

@mkaczanowski
Copy link
Contributor

issue created: informalsystems/tendermint-rs#1356

@mkaczanowski
Copy link
Contributor

funny enough this value comes from tmkms:
https://github.com/informalsystems/tendermint-rs/pull/696/files

(git blame)

@IbrarMakaveli
Copy link

@tony-iqlusion I've just tried again with this tag, I still have this error, has anyone been able to test with SoftSign ?
Is there a Linux dependency ? A parameter to change ?

@qezz
Copy link

qezz commented Jan 16, 2024

Getting back to this issue. Today I discovered that

  • using unix socket works just fine
  • using (local) tcp connection produces buffer underflow

more specifically,

2024-01-16T18:18:31.875640Z DEBUG tmkms::session: [atlantic-2@tcp://...:26759] sending response: PublicKey(PubKeyResponse { pub_key: Some(PublicKey { sum: Some(Ed25519([ ... < cut >
2024-01-16T18:18:32.056127Z ERROR tmkms::client: [atlantic-2@tcp://...:26759] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
2024-01-16T18:18:33.056223Z DEBUG tmkms::session: [atlantic-2@tcp://...:26759] connecting to validator...
2024-01-16T18:18:33.056288Z  INFO tmkms::connection::tcp: KMS node ID: <nodeid>
2024-01-16T18:18:33.056706Z  INFO tmkms::session: [atlantic-2@tcp://...:26759] connected to validator successfully
2024-01-16T18:18:33.056712Z  WARN tmkms::session: [atlantic-2@tcp://...:26759]: unverified validator peer ID! (<id>)
2024-01-16T18:18:33.156651Z ERROR tmkms::client: [atlantic-2@tcp://...:26759] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
2024-01-16T18:18:34.156745Z DEBUG tmkms::session: [atlantic-2@tcp://...:26759] connecting to validator...
2024-01-16T18:18:34.156817Z  INFO tmkms::connection::tcp: KMS node ID: <nodeid>

then it manages to connect to validator for several blocks, and then fails again

@IbrarMakaveli
Copy link

Thanks for the feedback, I was thinking that I'm not the only one with this problem, how do you use unix socket on tmkms ?

@qezz
Copy link

qezz commented Jan 17, 2024

how do you use unix socket on tmkms?

we use it for local connection only, so tmkms should be running on the same host (n.b. there are cons to this approach, as you may guess).

so for the cosmos chain

# config.toml
[priv-validator]
laddr = "unix://path/to/somewhere/kms.sock"

and in tmkms config you set the same address

[[validator]]
addr = "unix://path/to/somewhere/kms.sock"

(it's pretty late for me, so I hope I copypasted the correct thing)

anyway will look into it shortly (on Wed or Thu) and try to make it work

UPD: IIRC, the cosmos chain will create the socket, so tmkms will try connect to it. I would also recommend cleaning up (i.e. removing) the socket on every (re)start

@tony-iqlusion
Copy link
Member

I can reopen this issue, however really issues should be filed against tendermint-p2p:, similar to this one: informalsystems/tendermint-rs#1356

@tony-iqlusion tony-iqlusion reopened this Jan 17, 2024
@IbrarMakaveli
Copy link

Do you have any news on this, if I can help with anything, unfortunately we're using tmkms on a remote machine, we can't make a Unix Socket connection, in the short term isn't there a workaround ? Thanks

@tony-iqlusion
Copy link
Member

@IbrarMakaveli I filed informalsystems/tendermint-rs#1392 to request upstream help debugging this problem.

What would be extremely helpful here is if someone could add reproduction instructions to that issue, especially if the issue is reproducible directly via the tendermint-p2p crate without involving TMKMS (or isolating TMKMS as the problem)

@tony-iqlusion
Copy link
Member

Can someone attempt to reproduce this on a fresh install, which should use tendermint-p2p v0.34.1?

@qezz
Copy link

qezz commented Mar 5, 2024

Rolled out 0.14.0-pre.1 to our Sei testnet validator, will let you know

Using tcp connection within the same host

@qezz
Copy link

qezz commented Mar 5, 2024

Unfortunately the same error

image

Several signed blocks, then an underflow error

2024-03-05T20:18:43.118481Z DEBUG tmkms::session: [atlantic-2@tcp://...:51759] received request: ShowPublicKey
2024-03-05T20:18:43.118506Z DEBUG tmkms::session: [atlantic-2@tcp://...:51759] sending response: PublicKey(PubKeyResponse { pub_key: Some(PublicKey { sum: Some(Ed25519([162>
2024-03-05T20:18:43.314290Z ERROR tmkms::client: [atlantic-2@tcp://...:51759] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
2024-03-05T20:18:44.314384Z DEBUG tmkms::session: [atlantic-2@tcp://...:51759] connecting to validator...
2024-03-05T20:18:44.314456Z  INFO tmkms::connection::tcp: KMS node ID: ...
2024-03-05T20:18:44.314862Z  INFO tmkms::session: [atlantic-2@tcp://...:51759] connected to validator successfully
2024-03-05T20:18:44.314869Z  WARN tmkms::session: [atlantic-2@tcp://...:51759]: unverified validator peer ID! (a47c7867b3191c93eed4bf0f01a9d4bc95a193ac)
2024-03-05T20:18:44.414468Z ERROR tmkms::client: [atlantic-2@tcp://...:51759] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow

@tony-iqlusion
Copy link
Member

@qezz can you confirm that tendermint-p2p v0.34.1 was used in the build? (I can pin it in the next prerelease)

@qezz
Copy link

qezz commented Mar 5, 2024

let me check

@qezz
Copy link

qezz commented Mar 5, 2024

In the build log it says

...
   Downloaded tendermint-p2p v0.34.1
...
   Compiling tendermint-proto v0.34.1
   Compiling yubihsm v0.42.1
   Compiling tendermint v0.34.1
   Compiling cosmos-sdk-proto v0.20.0
   Compiling tendermint-p2p v0.34.1
   Compiling tendermint-config v0.34.1

@tony-iqlusion
Copy link
Member

Thanks, I reopened this issue: informalsystems/tendermint-rs#1392 (comment)

@datanexus-vincent
Copy link

We're running into the same issue very consistently with Initia's testnet when we turn on the oracle, which I believes adds a significant amount of data to the TMKMS requests.

2024-06-14T01:46:26.861895Z ERROR tmkms::client: [initiation-1@tcp://validator:26658] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
2024-06-14T01:46:27.862227Z DEBUG tmkms::session: [initiation-1@tcp://validator:26658] connecting to validator...
2024-06-14T01:46:27.864663Z  INFO tmkms::session: [initiation-1@tcp://validator:26658] connected to validator successfully

Has anyone attempted to build with @zarkone's upstream PR?

@mkaczanowski
Copy link
Contributor

mkaczanowski commented Jun 14, 2024

@datanexus-vincent did you try to build on top of:
https://github.com/iqlusioninc/tmkms/pull/903/files

I think that shall fix your issue (as it did fix it for SEI). Though we haven't checked the Initia yet.

@tony-iqlusion we'd appreciate your PR review :)

@datanexus-vincent
Copy link

@mkaczanowski I did once I realized it wasn't an upstream PR but a PR for this repo, and it worked! Thanks for the effort to get that working.

@tombeynon
Copy link

We are seeing this issue on dYdX mainnet currently - I suspect related to the number of prices including in the vote extension, but unsure. It manifests as the protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow message for a number of blocks, the node stalls, then catches up and starts signing again. Repeat every few hours, although it seems to be getting more frequent.

@tony-iqlusion
Copy link
Member

@tombeynon can you try this branch and see if it fixes your problem: #903

@tombeynon
Copy link

@tony-iqlusion Just switched over, I'll let you know tomorrow morning if it resolved it. Thanks!

@tombeynon
Copy link

@tony-iqlusion no issues at all since switching to that branch, definitely seems to have solved the problem

@zarkone
Copy link
Contributor

zarkone commented Aug 6, 2024

@tombeynon similarly to Sei, dydx has forked cometbft with custom proto declarations. It's not exactly the same as in case of Sei, but, according to data from you it should be something very similar.

Sei situation described in this PR: sei-protocol/sei-tendermint#240

@tombeynon
Copy link

@zarkone Okay great, I'll communicate this with the dYdX team. That PR is very well detailed, thank you!

@zarkone
Copy link
Contributor

zarkone commented Aug 6, 2024

@tombeynon thanks! Might make sense to confirm that it's the same and find out which .proto msg exactly triggers the underflow error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants