-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD 0195 - tpu vote using QUIC #195
base: main
Are you sure you want to change the base?
Changes from 4 commits
4a98389
f4378d5
30d05b9
c78b399
8fe3640
3928ca9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
--- | ||
simd: '0034' | ||
title: TPU Vote using QUIC | ||
authors: | ||
- Lijun Wang <[email protected]> | ||
category: Standard | ||
type: Core | ||
status: Review | ||
created: 2024-11-13 | ||
development: | ||
- Anza - WIP | ||
- Firedancer - Not started | ||
--- | ||
|
||
## Summary | ||
|
||
Use QUIC for transporting TPU votes among Solana validators. This requires | ||
supporing receiving QUIC based vote TPU packets on the server side and sending | ||
QUIC-based TPU vote packets on the client side. | ||
|
||
|
||
## Motivation | ||
|
||
As timely vote credits are awarded to validators, they might be incentived to | ||
increase the TPU vote traffic to ensure their votes are received in a timely | ||
manner. This could cause congestions and impact overall TPU vote processing | ||
effectiveness. The concurrent UDP based TPU vote does not have any flow control | ||
mechanism. | ||
|
||
We propose to apply the pattern taken for TPU transaction processing to TPU vote | ||
processing -- by utlizing the flow control mechanism which were developed | ||
including built-in QUIC protocol level flow control, and application-level rate | ||
limiting on connections and packets. | ||
|
||
## Alternatives Considered | ||
|
||
There is no readily-available alternative to QUIC which addresses some of the | ||
requirements such as security (reliability when applying QOS), low latency and | ||
flow control. We could solve the security and flow control with TLS over TCP | ||
the concern is with the latency and head-of-line problems. We could also | ||
customize and build our own rate limiting mechanism based on the UDP directly, | ||
this is non-trivial and cannot solve the security problem without also rely on | ||
some sort of crypto handshaking. | ||
|
||
## New Terminology | ||
|
||
None | ||
|
||
## Detailed Design | ||
|
||
On the server side, the validator will bind to a new QUIC endpoint. Its | ||
corresponding port will be published to the network in the ContactInfo via | ||
Gossip. The client side will use the TPU vote QUIC port published by the server | ||
to connect to the server. | ||
|
||
The TPU vote will be using the same QUIC implementation used by regular | ||
transaction transportation. The client and server both uses their validator's | ||
identity key to sign the certificate which is used to validate the validator's | ||
identity especially on the server side for the purpose of provding QOS based on | ||
the client's stakes by checking the client's Pubkey -- stake weighted QOS. | ||
|
||
Once a QUIC connection is established, the client can send vote transaction | ||
using QUIC UNI streams. In this design, a stream is used to send one single Vote | ||
transaction. After that the stream is closed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How fast is it to open and close a stream? The QUIC connection will be persistent I assume? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open and close stream is light-weight operation. Can be done in one shot: open stream, send and close |
||
|
||
The server only supports connections from the nodes which has stakes who can | ||
vote. Connections from unstaked nodes are rejected with `disallowed` code. | ||
|
||
The following QOS mechanisms are employed: | ||
|
||
* Connection Rate Limiting from all clients | ||
* Connection Rate Limiting from a particular IpAddress | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the limits here? I thought in QUIC we rate limit by pubkey instead of IP? Because others can spoof your IP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For stake validation we are validating the pub key. This is only for connection rate limiting before handshake completion |
||
* Total concurrent connections from all clients -- this is set to 2500 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it dangerous to increase this limit? What if we suddenly have >2500 clients? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is up to the implementation. And we have plan to make this configurable. 2500 is more than enough for votes in mb |
||
* Max concurrent connections from a client Pubkey -- this is set to 1 for votes. | ||
* Max concurrent streams per connection -- this is allocated based on the ratio | ||
of the validator's stake over the total stakes of the network. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the reasoning here? Why does higher-staked validator need more concurrent streams per connection? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Give them more bandwidth as reward and ensure they have better chance landing their votes? |
||
* Maximum of vote transactions per unit time which is also stake weighted | ||
|
||
When the server processes a stream and its chunk, it may timeout and close the | ||
stream if it does not receive the data in configured timeout window (2s). | ||
|
||
The validator also uses gossip to pull votes from other validator. This proposed | ||
change does not change the transport for that which will remain to be UDP based. | ||
As the gossip based votes are pulled by the validator, the concern with | ||
increased votes traffic is lessened. | ||
|
||
## Impact | ||
|
||
QUIC compared with UDP is connection based. There is an extra overhead to | ||
establish the connections when sending a vote. To minimize this, the client | ||
side can employ connection caching and pre-cache warmer mechanism based on the | ||
leader schedule. | ||
|
||
## Security Considerations | ||
|
||
The are no net new security vulnerability as QUIC TPU transaction has already | ||
been in-place. Similar DoS attack can be targeted against the new QUIC port used | ||
by TPU vote. The connection rate limiting is one tool to fend off such attacks. | ||
|
||
## Backwards Compatibility | ||
|
||
Care need to taken to ensure a smooth transition into using QUIC for TPU votes | ||
from UDP. | ||
|
||
Phase 1. The server side will support both UDP and QUIC for TPU votes. No | ||
clients send TPU votes via QUIC. | ||
|
||
Phase 2. After all staked nodes are upgraded with support of receiving TPU votes | ||
via QUIC, restart the validators with configuration to send TPU votes via QUIC. | ||
|
||
Phase 3. Turn off UDP based TPU votes listener on the server side once all | ||
staked nodes complete phase 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saying: validator id to vote account association and the stake distribution could change at the Epoch boundary. At Epoch boundary you probably need to accept if the validator has stake in either the old Epoch or the new Epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stake info is periodically updated and is eventual consistent. It applies the QOS based on the latest info as it see it. I do not think we need to complicate it to look at potentially two sets of data.