Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime upgrade: Methods to avoid ever including parachain code in critical-path data #967

Open
rphmeier opened this issue Jun 19, 2021 · 5 comments
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@rphmeier
Copy link
Contributor

rphmeier commented Jun 19, 2021

When performing parachain code upgrades, we currently include the new parachain code.

  1. When announcing the code upgrade, the full code appears in the candidate receipt.
  2. When applying the code upgrade, the full code appears in the PoV, as it moves from one section of the state to the next.

The code we have empirically for parachains is quite large, typically in the 500K to 800K range (Sergei: I observed PVFs up to a couple of megabytes).

Avoiding code in the critical path is important because it reduces friction at runtime upgrade points, if backing groups have relatively low bandwidth. It's not unreasonable for upgrade blocks to take a few minutes to get backed in the status quo. It also makes the code size more independent from the PoV size, opening up the opportunity for parachain developers to build more complex runtimes without being affected by restrictions targeting critical-path bandwidth.

This issue will be split into two sections, one for each of these points.

Solving Code in Candidate Receipts: Hash-based announcements

At the moment, PVFs announce code upgrades by returning the full code when it's allowed, according to the state root of the relay chain. This code then appears, in full, in the candidate commitments. These candidate commitments are, in turn, gossiped among all validators so they can be included into the relay chain by the block author, who is most likely not a backer of the parachain doing the code upgrade.

An improvement to this situation would be for the PVF to only output the hash and size of the code, for inclusion in the candidate commitments.

Upon reaching the relay chain, the future code announcement creates a grace period where any user of the relay chain can upload the code using an UnsignedTransaction. These uploads are not on the critical path of parachain execution, and parachain code upgrades need to be delayed anyway for other reasons (See paritytech/polkadot#3211). Once the code is actually uploaded to the relay chain, the relay chain is ready for the parachain to upgrade its code and after the code_upgrade_delay, as specified in the HostConfiguration, the code can be upgraded at any time.

Solving Code in upgrade parablock PoVs: Move code to the PVF parameters and the AvailableData.

When a parachain actually triggers its code upgrade, in practice, it involves the PVF moving the new code from one section of the trie to another. Although this is not strictly necessary within the parachain execution model, Cumulus-based parachains store their code in the state trie.

There are two approaches I considered to solve this problem:

  1. Write PVFs in such a way that the code doesn't need to be moved around in its state. What this means for Cumulus is making :code less special, or giving a way for :code to specify some other trie node which actually holds the real code.
  2. Pass the code, optionally, into the PVF, so it doesn't need to be loaded from the PoV. When it is passed, keep it available in the AvailableData.

The problem with approach 1 is that although we avoid including the code in the PoV at the point of the upgrade, we still have include the code in the PoV in some other block, where the code was moved into the storage of the parachain. This makes it a non-solution, so we'll ignore it and look at approach 2.

The idea of approach 2 is to make 3 alterations to parachain primitives:

struct AvailableData {
    // other fields..
   code: Option<ValidationCode>, // new
}

struct CandidateDescriptor {
  // other fields..
  applies_upgrade: bool, // new
}

struct ValidationParams {
  // other fields..
  code: Option<ValidationCode>, // new
}

With these changes, we continue to make the code available in the erasure-coding of the AvailableData that is kept by the entire validator-set, but it no longer needs to be sent explicitly between the collator and the backers or between the backers. Instead, if applies_upgrade is true, the backers can draw the code from other sources. At the moment, scheduled validation code is stored on-chain, but even in the future, when validation code is stored off-chain, the backing validators will have it to pass into the PVF.

Since the backing pipeline is the critical path, reducing the bandwidth between these actors will have a huge beneficial effect on the performance of the blocks applying runtime upgrades.

It is illegal for the CandidateDescriptor to contain applies_upgrade == true if the context it is executed in does not have a scheduled code upgrade for the parachain. Honest backers won't place Some into AvailableData::code if applies_upgrade == true. The runtime of the relay chain will reject all such candidates, so it's known that every candidate receipt that appears on-chain, pre-availability, correctly indicates whether the AvailableData::code should contain Some.

As an approval checker or a dispute participant, if the applies_upgrade == false and the AvailableData::code is Some, the candidate is invalid, as well as vice-versa. This means that any malicious backers which have managed to include a false AvailableData are slashed, and also that the candidate won't be finalized. This check is safe, because anything that has included has already passed the runtime check in the past. If these checks pass, and AvailableData::code is Some, then it should be passed into the PVF during the approval check.

Lastly on the core protocol side, the only thing that the parachain storage needs to store is the hash of the upcoming code. When the PVF accepts the new code, it can check that the code passed in hashes to the correct value, and then write it to its state. Writes to state don't affect the PoV size significantly as a general rule, but especially when a trie node with given (:code in practice) is already present.

Implementing this new PVF for Cumulus nodes poses a small additional challenge because of its requirements:

  1. The head-data from the PVF needs to match the header of the actual block in the associated Substrate chain.
  2. The entire Cumulus chain should be executable from the genesis with nothing other than a state DB.

From these requirements, it's clear that the actual blocks that Cumulus nodes synchronize, store, and execute, need to contain the new code at some point in the chain. So the challenge is to find a way to do this in a way where the PoV never does.

The solution that I propose is to have a special inherent, something like this:

ApplyNewCode(ValidationCode),

This is what appears in the full block outside of the PVF. However, what appears in the PVF is a slightly modified version

ApplyNewCode, // Parameter is implicit.

In the initial stages of PVF execution, if this inherent is found, then the PVF must have accepted Some(validation_code) as its argument or the inputs are invalid. It can replace the stub inherent with the full version, and this achieves the 3 goals that the produced Cumulus blockchain contains the new code, the PVF produces head data that matches the Cumulus blockchain, and that the PoV doesn't contain any code ever.

@burdges
Copy link

burdges commented Jun 19, 2021

I've lost our recent issue discussing this now, but.. We could handle :code as a distinguished recent parachain block, not state data. We'd need a mechanism for fetching a recent block from availability, which sounds heavy, but parachains could bypass this cost because nodes already cached their build.

We could reuse this technique for MEV protections in parachains too:

We split the parachain block into "run now" and "run 10 slots in the future", perhaps by pushing a much of transactions into the state, but preferably by splitting the availability encoding. In other words, we ideally make block n actually process the block of transactions placed into availability by block n-10. We then permute the transaction order by the relay chain randomness, so transactions could now fail but block n's backing checker marks the bad ones. This provides MEV protection.

We'll include two ephemeral decryption keys associated to sassafras slot assignment proofs, for which the upcoming block producer knows the secret key, but when the block producer makes their sassafras block they delete the first secret key, as they've already decrypted any transactions, and then publish the second in the header. We turn this into even stronger MEV protection by decrypting in block n the transactions placed into availability in block n-10 using the keys published by the intervening blocks. In other words, we'd prevent MEV by running something vaguely like mixnet style decrpytion on-chain.

In both cases, we need a whole block to either hang out in state for 10 slots or else provide some means by which the block 10 slots later fetches it form availability.

@rphmeier
Copy link
Contributor Author

@burdges

I'm generally not a fan of the "treat code upgrades as a special block" because it's unclear how Cumulus should handle that block. As mentioned in the issue, we have the goal that the produced Cumulus chain can be synchronized entirely on its own. I don't think we could do the 'special block' thing unless we altered Substrate itself to support those types of special blocks. That sounds really difficult so it's a class of solution I would prefer to avoid.

@burdges
Copy link

burdges commented Jun 19, 2021

Yes, it'd ask substrate to treat special blocks like detached state data and alters pruning rules, so yes it touches several things and I'm unsure the complexity. It's roughly your 2 though, no?

Also, where was the other recent issue you opened on this? I'd started to come around to your perspective, but now lost my train of thought..

I should reread my own thoughts in paritytech/polkadot#3211 too. ;)

@cheme
Copy link
Contributor

cheme commented Jun 21, 2021

the PVF must have accepted Some(validation_code) as its argument or the inputs are invalid.

Not sure if there is a way to avoid putting 'validation_code' in memory when running the PVF?

Does not seems easy without specific validation of block data in polkadot. (or some mechanism involving specific host function that would build some specific hashing with the external validition_code, and thus a validation function a bit different than the runtime (or overload of a host function for it as currently done for diverging code)).

@rphmeier
Copy link
Contributor Author

rphmeier commented Jun 21, 2021

@cheme We don't care (that much) about memory usage. This is about PoV size. I am not sure you have understood the issue well enough.

The only change this needs on the trie side is to make sure that when overwriting but not reading :code, the old value of :code does not appear in the PoV. Everything else will be handled by parachain protocol changes described in this issue. But this trie optimization is out of scope for the issue and should be discussed elsewhere

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. and removed I10-optimisation labels Aug 25, 2023
@eskimor eskimor changed the title Methods to avoid ever including parachain code in critical-path data runtime upgrade: Methods to avoid ever including parachain code in critical-path data Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
Status: Backlog
Development

No branches or pull requests

4 participants