Bump VALIDATION_CODE_BOMB_LIMIT #7710

sandreim · 2025-02-25T15:06:51Z

It used to be 16MiB some time ago. Bumping to 32MiB sounds sensible to me, as we've seen some recent runtimes expanding close to the previous 16MiB.

This is just a short term fix for #7664 - needs to be backported and deployed ASAP.

While this unblocks assethub, I will create another PR to promote it into a HostConfiguration parameter and associated runtime api.

Signed-off-by: Andrei Sandu <[email protected]>

sandreim · 2025-02-25T15:10:48Z

/cmd prdoc --bump patch --audience node_dev

Signed-off-by: Andrei Sandu <[email protected]>

…ode_dev'

bkchr

This can lead to disputes? When 3/4 + 1 have upgraded and some uploads such a big runtime, other nodes that do not support this will start disputes?

sandreim · 2025-02-26T08:02:18Z

This can lead to disputes? When 3/4 + 1 have upgraded and some uploads such a big runtime, other nodes that do not support this will start disputes?

The same concern prevented me from merging it yet.

However, It should not. I have looked at the code and this error is handled as deterministic internal error. At worst, the PVF precheck would fail if not enough validators are upgraded. Otherwise if it passed precheck, unupgraded validators will not issue a dispute and just log an error.

@s0me0ne-unkn0wn can you confirm the above?

alexggh · 2025-02-26T08:38:08Z

Otherwise if it passed precheck, unupgraded validators will not issue a dispute and just log an error.

But, they would still be no-shows and increase the finality lag for everyone, so with 1/4 of validators being no-shows and 500 validators, my back of envelope math tells me that finality lag on average will be around 15 blocks.

Rough Explanation:

Round 1: You need 30 assignments randomly, you can assume 1/4 will be no-shows so you have 7.5 no-shows.
Round 2: You need 8 tranches to cover the no-shows: 2.33 assignments per tranches * 8 = 18 assignments, 1/4 no-shows, 4.5 no-shows.
Round 3: You need 5 tranches to cover the no-shows: 2.33 * 5 = 11.65 assignment, 2.91 no-shows
Round 4: You need 3 tranches: 2.33 * 3 = 6.99, 1.74 no-shows
Round 5: You need 2 tranches: 2.33 *2 = 4.66: 1.16 no-shows, depending on how lucky we are we should be able to finalize here or the next round.

eskimor · 2025-02-26T10:34:06Z

Apart from that it should be 2/3rds + 1, this concern sounds valid. I think the best we can do here is roll this out in two phases to be safe:

Bump the limit only if not pre-checking.
Bump the limit in pre-checking too once enough nodes upgraded.

Assuming validators are at least following minor version bumps, the time between (1) and (2) can hopefully be rather small, if we backport to all running releases. I guess validators at least following point releases is just wishful thinking though?

Alternative hack: Temporarily increase pre-checking threshold to something that makes the worst-case finality lag acceptable. Requires a runtime upgrade, so will also take a while - also the problem stays roughly the same: The bigger limit won't be usable until enough validators upgraded, but it is a bit safer - as the "enough validators upgraded" is enforced by the protocol.

s0me0ne-unkn0wn · 2025-02-26T10:35:55Z

@s0me0ne-unkn0wn can you confirm the above?

We never dispute on the basis of pre-checking results. The reasoning is that the pre-checking results are the basis for future disputes by themselves. If the pre-checking has passed successfully, and later a preparation (with lenient timeout) fails, we dispute. If the pre-checking fails, we never enact the PVF in the first place, so there's no reason to dispute here. So, the pre-checking output is strictly binary.

Also, FWIW, the note on rejecting PVFs from docs:

It is possible that the candidate validation was not able to check the PVF, e.g. if it timed out. In that case, the PVF pre-checker will vote against it. This is considered safe, as there is no slashing for being on the wrong side of a pre-check vote.

Rejecting instead of abstaining is better in several ways:

Conclusion is reached faster - we have actual votes, instead of relying on a timeout.

Being strict in pre-checking makes it safer to be more lenient in preparation errors afterwards. Hence we have more leeway in avoiding raising dubious disputes, without making things less secure.

Also, if we only abstain, an attacker can specially craft a PVF wasm blob so that it will fail on e.g. 50% of the validators. In that case a supermajority will never be reached and the vote will repeat multiple times, most likely with the same result (since all votes are cleared on a session change). This is avoided by rejecting failed PVFs, and by only requiring 1/3 of validators to reject a PVF to reach a decision.

bkchr · 2025-02-26T12:03:04Z

I think the best we can do here is roll this out in two phases to be safe:
1. Bump the limit only if not pre-checking.

2. Bump the limit in pre-checking too once enough nodes upgraded.

Or we directly do it via the HostConfiguration. Then we only need to wait once that enough people have upgraded.

sandreim · 2025-02-26T12:36:17Z

Apart from that it should be 2/3rds + 1, this concern sounds valid.

If the concern is finality yes, I agree, the concern is increased finality lag, which we can mitigate by pinging validators to upgrade.

Bump the limit only if not pre-checking.

Bump the limit in pre-checking too once enough nodes upgraded.

This approach is faster than HostConfiguration approach since we don't need to wait for runtime upgrades.
Also, after step 1, there wouldn't be any improvement of the status quo - parachains will still not be able to upgrade their runtime, until we get to phase 2, which also requires waiting for nodes to be upgraded.

So, the options are, ordered by the how fast we fix the problem:

one-shot, this PR
2 step node side approach, change validation limit, then pre-checking limit
Host configuration approach: requires sdk major release, fellowship release and enacting on relay chain

What we choose depends on how soon we need this to be fixed in production which depends on the impact. Are teams already affected or we know that soon they will be?

bkchr · 2025-02-26T12:39:40Z

This approach is faster than HostConfiguration approach since we don't need to wait for runtime upgrades.

The HostConfiguration changes can be prepared and rolled out to validators quite fast. While we need to wait any way for them to upgrade, the runtime upgrade can happen in parallel.

sandreim · 2025-02-26T12:44:38Z

This approach is faster than HostConfiguration approach since we don't need to wait for runtime upgrades.

The HostConfiguration changes can be prepared and rolled out to validators quite fast. While we need to wait any way for them to upgrade, the runtime upgrade can happen in parallel.

I don't think this will happen fast because it requires a stable release, as it is a major bump due to the HostConfiguration struct change.

bkchr · 2025-02-26T13:05:38Z

I don't think this will happen fast because it requires a stable release, as it is a major bump due to the HostConfiguration struct change.

Then just implement this in a backwards compatible way.

sandreim · 2025-02-27T09:29:51Z

I don't think this will happen fast because it requires a stable release, as it is a major bump due to the HostConfiguration struct change.

Then just implement this in a backwards compatible way.

I cannot find any way to change that struct without needing or adding more code bloat to handle a separate storage value + active / pending per session tracking of the parameter change.

So I'm proposing the 4th option:

Obsolete the constant VALIDATION_CODE_BOMB_LIMIT in the node primitives;
introduce a new constant in the runtime itself: MAX_VALIDATION_CODE_COMPRESSION_RATIO that is used to compute the bomb limit based on the max_code_size.
add a runtime API to retrieve the code bomb limit

Doing this will result in a limit of 30MB based on current configuration, and is also future proof, because if we increase the code size the bomb limit also increases. This will be a minor semver bump so we won't break anything and we can do a quick release. On the node side however it could be trickier with the semver, but we can manage that as only our node uses the affected crates.

Overkillus · 2025-02-28T14:42:13Z

Options

I see 4 solutions being posted:

(Andrei's Original) Just bump the VALIDATION_CODE_BOMB_LIMIT to higher.

requires node side upgrade
asymptomatic untill 66% upgrade
can cause consistent finality lags of 20+ when above 66% upgrade rate
higher that 66% upgrade rate decreases finality lag
simple and effective quickly

(Robert's Suggestion) Split the limit into 2 one for prechecking and one for other contexts then update them in 2 stages

requires 2 node upgrades
first node upgrade is completely asymptomatic
second code upgrade is asymptomatic untill it reaches 66%
when second code upgade reaches 66% the finality lag depends only on the upgrade rate of the first code upgrade (same as solution one, where lag maxes at 66%)
the benefit is we can delay second node upgrade untill we are confident in the upgrade rate of first code upgrade (we essentialy skip the awkward moment where 1/3 of the nodes no-show)

(Basti's Solution) Just use Host Config

requires a node upgrade (we need to stop using the local constant)
requires a runtime upgrade (to update hostConfig potentially with a major release)
node change can be made asymptomatic with a default to local if hostConfig is missing this field
runtime upgrade to add this is difficult to make in a backwards compatible way (just quoting Andrei here)
this is a nice long-term solution allowing for runtime parametrisation

(Andrei's New Proposal) Runtime Constant

requires a node upgrade (obsoleting VALIDATION_CODE_BOMB_LIMIT)
requires a runtime upgrade
node change can be made asymptomatic with a sensible default
runtime upgrade will trigger the switch just like in solution 2

Solution 4 has the nice property of controlling when we actually make the switch so we skip the awkward 66% support phase (just like solution 2) but is cleaner than solution 2 and allows for future governance runtime parametrisation like solution 3. Ideally this should be in the host config as explained in solution 3 but if adding it there leads to complications it can exist as a separate constant. IMO 4 is the cleanest and leads to least risk.

Upgrade risks

Additionally I did a more thorough check as also verified that this will NOT trigger disputes even in approval checking and just result in no shows. It will be treated as a deterministic preparation error and the node will silently wait for the no-show duration even though it is certain it will no-show.

Question is how the finality lag behaves for different upgrade rates. Alex did some initial calc and I think they are mostly on point (except assuming 1/4 noshow rate).

I did a quick script here that graphs how many approval checkers (for the overweight runtime candidates) we will need:
(Assuming validator_set = 500, tranche_size = 2.33, tranche_zero_size = 30)

And then based on how many rounds it took we can estimate the finality lag:

Delay tranches are .5s long but each time we add more tranches it effectively takes 3 slots (18s) as this is how ling it takes for the no-show to be registered (@alexggh thanks for reminding me!)

So as an example with 75% upgrade rate we will have a 25% no-show chance which leads to an average of 70 or so approval checkers (compared to default 30 ish). No shows are spread over around 5 rounds which results i roughly 5*3 = 15 finality lag.

Real Network

On 2/3 upgrade rate we will average 30ish finality lag. All of this is assuming we don't have no-shows for any other reasons so in the real network it will only be worse.

Additionaly it will get slightly worse in the real network because of random deviations. Assume we have average finality lag 15:
Relay block X gets a delay by 18 blocks due to no shows (it got a bit unlucky with no-shows)
Relay block X+1 gets lucky and will be delayed 13 blocks.

X+1 depends on X so we efectively have to wait for X+1's 18 blocks anyway which will always result in the unluckiest no-shows slowing us down.

This effect can add a few extra blocks here and there and I did not account for it in the simulation.

Explicit No-Shows

(@burdges)

I will take this oppurtunity to yet again pitch an idea of Explicit No-Shows. This is another case where having them would mitigate most of the accidental risks introduced by this change. Explicit no-shows are not just for back-pressuring.

Honest nodes that don't upgrade in this example were quickly CERTAIN they will no-show and yet the whole network waits for 18s to figure out that they will no-show. They could just tell us that they will no-show once they encountered internal error and this whole finality lag would have been over in a span of a single slot.

sandreim · 2025-02-28T17:54:36Z

Closing because #7760 is the better choice

Fixes #7664 The discussion in #7710 should provide the context for why this solution was picked. --------- Signed-off-by: Andrei Sandu <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fixes #7664 The discussion in #7710 should provide the context for why this solution was picked. --------- Signed-off-by: Andrei Sandu <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit f02134c)

Bump VALIDATION_CODE_BOMB_LIMIT

85b6365

Signed-off-by: Andrei Sandu <[email protected]>

sandreim added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. A4-needs-backport Pull request must be backported to all maintained releases. labels Feb 25, 2025

remove max code size

2537e67

Signed-off-by: Andrei Sandu <[email protected]>

alexggh approved these changes Feb 25, 2025

View reviewed changes

Update from sandreim running command 'prdoc --bump minor --audience n…

096b33a

…ode_dev'

seadanda approved these changes Feb 25, 2025

View reviewed changes

bkchr reviewed Feb 25, 2025

View reviewed changes

s0me0ne-unkn0wn approved these changes Feb 26, 2025

View reviewed changes

AndreiEres approved these changes Feb 26, 2025

View reviewed changes

sandreim mentioned this pull request Feb 28, 2025

Dynamic uncompressed code size limit #7760

Merged

sandreim closed this Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump VALIDATION_CODE_BOMB_LIMIT #7710

Bump VALIDATION_CODE_BOMB_LIMIT #7710

sandreim commented Feb 25, 2025 •

edited

Loading

sandreim commented Feb 25, 2025 •

edited

Loading

bkchr left a comment

sandreim commented Feb 26, 2025 •

edited

Loading

alexggh commented Feb 26, 2025 •

edited

Loading

eskimor commented Feb 26, 2025

s0me0ne-unkn0wn commented Feb 26, 2025

bkchr commented Feb 26, 2025

sandreim commented Feb 26, 2025

bkchr commented Feb 26, 2025

sandreim commented Feb 26, 2025

bkchr commented Feb 26, 2025

sandreim commented Feb 27, 2025 •

edited

Loading

Overkillus commented Feb 28, 2025 •

edited

Loading

sandreim commented Feb 28, 2025

Bump VALIDATION_CODE_BOMB_LIMIT #7710

Bump VALIDATION_CODE_BOMB_LIMIT #7710

Conversation

sandreim commented Feb 25, 2025 • edited Loading

sandreim commented Feb 25, 2025 • edited Loading

bkchr left a comment

Choose a reason for hiding this comment

sandreim commented Feb 26, 2025 • edited Loading

alexggh commented Feb 26, 2025 • edited Loading

eskimor commented Feb 26, 2025

s0me0ne-unkn0wn commented Feb 26, 2025

bkchr commented Feb 26, 2025

sandreim commented Feb 26, 2025

bkchr commented Feb 26, 2025

sandreim commented Feb 26, 2025

bkchr commented Feb 26, 2025

sandreim commented Feb 27, 2025 • edited Loading

Overkillus commented Feb 28, 2025 • edited Loading

Options

Upgrade risks

Real Network

Explicit No-Shows

sandreim commented Feb 28, 2025

sandreim commented Feb 25, 2025 •

edited

Loading

sandreim commented Feb 25, 2025 •

edited

Loading

sandreim commented Feb 26, 2025 •

edited

Loading

alexggh commented Feb 26, 2025 •

edited

Loading

sandreim commented Feb 27, 2025 •

edited

Loading

Overkillus commented Feb 28, 2025 •

edited

Loading