Even smarter full availability recovery from backers #3216

sandreim · 2024-02-05T19:41:36Z

Currently we have a static threshold for deciding when we fetch full availability from backers as a first strategy: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/availability-recovery/src/lib.rs#L81

We can improve this choice by taking into account multiple parameters:

PoV size
backing group size
recovery context (approval or dispute)

My suggestion would be to always fetch from backers first if backing group size is 4 or 5, no matter the size of the PoV when the context is approval. In a healthy network it should fine as we'd expect the validators to not be hammered by this (around only 48Mpbs which is only 10% of recommended specs )

For disputes we might want to be more conservative while for smaller backing groups we should determine a higher threshold.

The text was updated successfully, but these errors were encountered:

burdges · 2024-02-06T04:13:18Z

Why? We'd expect fewer first line sources to be a pessimization.

As a thought experement, imagine bittorrent being our transport:

We'd artificially have a torrent with one file for each availability chunk and a local tracker populated by our knowledge of upon backing votes, availability votes, and approval votes.

We'd employ partial torrent mode, initially select only the systematic chunks for download, and then flip to full torrent mode later once we have trouble. We only care here about the systematic chunks download phase for this conversation though.

How does bittorrent determine from whom it downloads which bittorrent chunks? It'll maybe bias towards more useful peers, but overall it cares little so long as they arrive and the hashes match. It definitely does not exclude peers just because they only hold one chunk.

We alraedy have open TLS+Noise connections between all validator pairs. Availability providers should hold systematic chunks in memory, exactly like backers do.

It's possible this optimizes over weaknesses elsewhere in our protocol stack, but if so then maybe those places should simply be fixed?

sandreim · 2024-02-06T20:37:26Z

Why? We'd expect fewer first line sources to be a pessimization.

What I describe here is a cheap incremental optimization and tuning of parameters which is quite easy to implement.

As a thought experement, imagine bittorrent being our transport:

We'd artificially have a torrent with one file for each availability chunk and a local tracker populated by our knowledge of upon backing votes, availability votes, and approval votes.

We'd employ partial torrent mode, initially select only the systematic chunks for download, and then flip to full torrent mode later once we have trouble. We only care here about the systematic chunks download phase for this conversation though.

This is somehwat similar to what we have now in the sense that we try systematic chunk holders first and go "full torrent" later. A similar discussion is also in #575

It's possible this optimizes over weaknesses elsewhere in our protocol stack, but if so then maybe those places should simply be fixed?

If you meant the network stack , that is being taken care of with litep2p. Otherwise there seem to be other implementations of reed solomon which outperform our current one.

burdges · 2024-02-06T23:15:16Z

What I describe here is a cheap incremental optimization and tuning of parameters which is quite easy to implement.

Why is it an optimization? Is it just because some choice is being made? Aka right now we're dong multiple things unecessarily or something?

alindima · 2024-02-07T07:43:00Z

What Andrei is suggesting is to request the full unencoded POV from the backing group (trying them one at a time), even for larger POVs. We currently only try this first for small POVs (under 128Kib).
It has the advantage that we don't need reed-solomon decoding (but so do systematic chunks) and that assuming the backing group is not overloaded and has large bandwidth, it'd be faster to make one large request instead of N/3 smaller ones to different validators (even if they're made in parallel).

Now, I'm not convinced it's a good idea to do this for all POV sizes. We should test this first

sandreim · 2024-02-07T11:24:55Z

What Andrei is suggesting is to request the full unencoded POV from the backing group (trying them one at a time), even for larger POVs. We currently only try this first for small POVs (under 128Kib). It has the advantage that we don't need reed-solomon decoding (but so do systematic chunks) and that assuming the backing group is not overloaded and has large bandwidth, it'd be faster to make one large request instead of N/3 smaller ones to different validators (even if they're made in parallel).

Thanks @alindima, this is exactly what I mean.

Now, I'm not convinced it's a good idea to do this for all POV sizes. We should test this first
Yes this is the fine tuning part, but I expect it to be fine as described in the issue. Keep in mind that validators specs do require 500Mbit bandwidth, so based on some simple math you can see that only 10% of that is needed on average per block for each backer when the backing group is of size 5.

Using subsystem-bench has showed that we can do it and I'd also expect to be cheaper than systematic chunks since we're much more efficient with memory, not to mention network overhead. We have one big piece vs stiching all systematic chunks together in memory. But that is not the point. We must test this well

@burdges I don't really believe the current systematic chunks approach is tractable in practice if we assume validators haven't got enough bandwidth. Your suggested approach makes the assumption that we'd be faster fetching partial torrent-style vs fetching full availability from a single guy. I agree that would be the case for huge PoVs, 100MB maybe, but for just 5MB, I'd doubt this make any sense with 500Mbit/s network.

Systematic chunk recovery would work so much better if we'd have validators hold more than 1 chunk and also fetch them from approval checkers assuming we are later tranche or just slow and have seen other same tranche approve it.

burdges · 2024-02-07T12:57:10Z

We currently only try this first for small POVs (under 128Kib).

I see, thanks. Yes good idea. :)

Your suggested approach makes the assumption that we'd be faster fetching partial torrent-style vs fetching full availability from a single guy. I agree that would be the case for huge PoVs, 100MB maybe, but for just 5MB, I'd doubt this make any sense with 500Mbit/s network.

Yeah, we maybe do not reequire the complexity here.

Systematic chunk recovery would work so much better if we'd have validators hold more than 1 chunk and also fetch them from approval checkers assuming we are later tranche or just slow and have seen other same tranche approve it.

We tune later tranches to be rare-ish, like maybe 15% (?) of total downloads. If we did torrent-like systematic downloads then yes approval checkers could collaboratively download like bittorrent swarms do, even during tranch zero.

We could've validators hold multiple chunks of course. Ignoring overhead our bandwidth looks like

2-5 x for backers downloading full PoVs
3 x for distributing chunks, and
30 x for approval checkers downloading chunks or full

It'd total increase availability bandwidht by 5-8 % if we sent one systematic chunk to the 2/3rd of validators not holding systematic chunks. You could also do a second faster erasure code for those other chunks, like sending these validators the xor of some random systematic chunks.

As you say, our PoVs blocks remain small enough so that maybe all this looks like overkill.

sandreim added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. I5-enhancement An additional feature request. labels Feb 5, 2024

sandreim mentioned this issue Feb 5, 2024

Add availability-recovery from systematic chunks #1644

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even smarter full availability recovery from backers #3216

Even smarter full availability recovery from backers #3216

sandreim commented Feb 5, 2024

burdges commented Feb 6, 2024

sandreim commented Feb 6, 2024 •

edited

Loading

burdges commented Feb 6, 2024 •

edited

Loading

alindima commented Feb 7, 2024

sandreim commented Feb 7, 2024 •

edited

Loading

burdges commented Feb 7, 2024

Even smarter full availability recovery from backers #3216

Even smarter full availability recovery from backers #3216

Comments

sandreim commented Feb 5, 2024

burdges commented Feb 6, 2024

sandreim commented Feb 6, 2024 • edited Loading

burdges commented Feb 6, 2024 • edited Loading

alindima commented Feb 7, 2024

sandreim commented Feb 7, 2024 • edited Loading

burdges commented Feb 7, 2024

sandreim commented Feb 6, 2024 •

edited

Loading

burdges commented Feb 6, 2024 •

edited

Loading

sandreim commented Feb 7, 2024 •

edited

Loading