remove extended Proposal payload before sending to remote signer (fixes buffer underflow) #240
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request addresses the problem of buffer underflow problem, which happens when tmkms signer is plugged over tcp to Sei.
Original issue, opened in tmkms repo iqlusioninc/tmkms#729
Even though the issue is opened in tmkms, I believe that fix should be applied in Sei.
What happens
This snippet illustrates the difference between Sei's
Proposal
and originalProposal
structure.As you can see, Tendermint protocol has a slight design flaw:
Proposal
is implicitly assumed to have a fixed length, but it doesn't declare it explicitly. Remote signer implementations assume that protobuf message, which is sent over the wire is always less than 1024 bytes.And this is not an issue for networks where
Proposal
holds the original structure, but not for Sei. When Sei accumulates a certain amount ofTxKeys
,Proposal
becomes bigger than 1024 bytes. When remote signer reads it, it reads 1024 bytes and tries to unmarshal it it fails withbuffer underflow
.Why not to fix it in remote signer?
Yes, at first glance, it feels that remote signer can just aggregate multiple chunks and unmarshal it once it gets the all the chunks. Why not do just do that?
The reason why it is not possible (at least, in efficient way) is hidden in another issue with old Tendermint (and Sei) tcp signer implementation.
When protobuf message is being send over the wire, it suppose to have a length encoded in the beginning of the message. And Sei puts the length, however, it is not the length of the whole message -- this is the length of a chunk.
See here:
sei-tendermint/privval/secret_connection.go
Lines 210 to 219 in 2568028
So instead of sending length in a first chunk, it sends a length of each chunk in each chunk, which makes it impossible to aggregate it on wire level. It becomes possible on protobuf level, in a very ugly way -- see the implementation here iqlusioninc/tmkms#903 .
In short, I've made a loop which aggregates chunks (with removed length) and tries to unmarshall it on each new chunk. Which means, if
Proposal
is e.g. 7kb it will try to unmarshal it 7 times: this solution is pretty fragile.What happens in this PR
While I was debugging the issue, I've noticed that even though tmkms gets Sei's
Proposal
, since it doesn't have Sei's.proto
files it uses original Tendermint.proto
to unmrashal theProposal
. So it signs the original proposal, or, "canonical", if you will, but not the extended Sei'sProposal
. However, it works fine on Sei Testnet without any issues. I dig further and found that local signer also uses stripped down Proposal, doingCanonicalizeProposal
before signing (see privval/file.go#L445).Note
I don't understand why only part of proposal is being signed -- maybe you could point me where to read about this, how does this work?
In any case, I figured that since Sei accepts short
Proposal
we can just remove additional fields. In this PR, I do it right before sending bytes because of the nature of protobuf: I don't see a sane way of how to send a different proposal without changing all the types in function definitions.If you know a better way how to do that please let me know: for now, you can treat this PR as an expression of intention of what I want to happen, rather then complete implementation.
Misc
The described problem of chunk length instead of msg length is out of scope here. As far as I can see, it is the case for cometBFT as well, since the code looks the same: cometbft/p2p/conn/secret_connection.go#L221 -- however, I didn't try to break it, it is hard to do without an extended
Proposal
.As for tmkms, even though I have a PR with workaround open, it is rather a hotfix for us and other validators.
The length problem should be addressed on a wire level, in tendermint-p2p: it doesn't do any aggregation and reads only one chunk now (see tendermint-rs/p2p/secret_connection.rs#L618). But it makes sense to fix only after it will be fixed in producers.
How to test?
Testing could be simple or could be tricky, depending on what tooling you already have. In order to see failure, you just need to plug Sei to tmkms over TCP. You need to have bigger transactions, so that TxKeys will become big enough to cause overflow. You can see observe it on testnet without any efforts, since oracle-feeder puts 30-50 txs almost in every block.
Locally, you can send a lot of transactions manually, something like
-- this works for me.
So you can check that without patch tmkms (and Sei) keep failing with buffer overflow, and with patch it works well.