Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erasure coding design overview #85

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

erasure coding design overview #85

wants to merge 7 commits into from

Conversation

dryajov
Copy link
Contributor

@dryajov dryajov commented May 28, 2022

Just a wip for now, some sections are still missing

@dryajov dryajov marked this pull request as ready for review May 31, 2022 00:54
design/erasure-coding.md Outdated Show resolved Hide resolved
design/erasure-coding.md Outdated Show resolved Hide resolved
@cskiraly
Copy link
Contributor

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

@dryajov
Copy link
Contributor Author

dryajov commented May 31, 2022

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

@cskiraly
Copy link
Contributor

cskiraly commented Jun 2, 2022

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.

@dryajov
Copy link
Contributor Author

dryajov commented Jun 2, 2022

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.

I see, yes this can make sense but it heavily depends on the placement we choose. I think this needs to be further clarified in the text and worded in such a way that it is clear that placement considerations would affect interleaving and vice-versa.

@dryajov
Copy link
Contributor Author

dryajov commented Jun 2, 2022

I think we can leave it as it is for now (with the appropriate clarification), since this is how it is implemented currently, but we can change it if it becomes a bottleneck. This is one part that might also be per-dataset configurable...

@leobago
Copy link
Contributor

leobago commented Jun 2, 2022

If we consider that a slot is a row (row: blocks belonging to different encoding columns to disperse erasures) then the ordering proposed by @dryajov will be beneficial for those reading sequential sections of the dataset. The other configuration (having sequential blocks 1, 2, 3.. in a column) would only provide acceleration benefits during the encoding. Assuming we encode once and read several times, the ordering proposed by @dryajov seems more beneficial to me.

@Bulat-Ziganshin
Copy link

We may have several different layouts. F.e. let's specify one of them, a simplified one with fair distribution of download bandwidth:

User provides S*K blocks and requests K+M encoding saving data over K+M nodes (thus each node stores S blocks).

For the first K nodes, block number Y on node X = original block number Y*K+X (i.e. blocks 0, 1 .. K-1, K... goes into nodes 0, 1 .. K-1, 0...). In other words, original block number I goes into block number I/K on node I%K.

For the last M nodes, block number Y on node X is a part of recovery group consisting from blocks number Y on all K+M nodes, i.e. original blocks Y*K+i, i=0..K-1 plus M recovery blocks.

Note however that this scheme places all original data to the first K nodes and all recovery data to the last M nodes. If we want to spread recovery data fairly over all nodes, we should reshuffle data between servers in each recovery group (note that a single recovery group still occupies block Y on all K+M severs).

The simplest way to do that is to shift every next group by one position relative to previous one:

  • blocks 0 .. K-1 are mapped to servers 0 .. K-1, correspondingly
  • block K .. 2*K-1 are mapped to servers 1 .. K, correspondingly
  • block 2*K .. 3*K-1 are mapped to servers 2 .. K+1, correspondingly
    and so on.

Recovery blocks are shifted accordingly, filling remaining servers in each recovery group:

  • recovery blocks 0 .. M-1 (i.e. recovering data from original blocks 0..K-1) are mapped to servers K .. K+M-1, correspondingly
  • recovery blocks M .. 2*M-1 are mapped to servers K+1 .. K+M-1, 0 correspondingly
  • recovery blocks 2*M .. 3*M-1 are mapped to servers K+2 .. K+M-1, 0, 1 correspondingly

@Bulat-Ziganshin
Copy link

Bulat-Ziganshin commented Jun 9, 2022

That's a writeup of arguments for using "layout B", i.e. layout described in my previous message, trying to put it in the form suitable for inclusion in the document. We have improved layout compared to the previous version, now shifting each next group in the balanced dispersal by K blocks instead of 1.

Preconditions

We assume that:

  • The main source of large data losses are situations when an entire host drops (temporarily or permanently) or a host loses an entire hdd/ssd
    • Thus, each recovery group should include as much hosts as it possible without degrading other characteristics
  • The dataset, that provided by user, has some internal structure and sequential data blocks are often more closely related than more distant blocks, f.e. they may be parts of the same file or closely related files, if the dataset is an archive or a disk image
    • Thus, reading of R sequential blocks is more likely than reading of other, arbitrary chosen sets of R blocks
    • Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)
  • The download requests should be fairly distributed among all hosts

Simple layout (reasoning and textual description)

The setting:

  • User selects recovery groups consisting of N=K+M blocks where K blocks represent the original data and M blocks are the computed ECC data
  • The entire dataset provided by user consists of S such groups
  • User selects the number of storage hosts H, and amount of data stored on the each host

We propose here layout only for the simple case of H=N hosts, each storing S blocks. More complex layouts are TBD.

Based on assumptions above, we choose the following layout:

  • In order to maximize recovery chances, we distribute each recovery group over all N hosts. This allows to recover all data even if up to M hosts are offline for any reason
  • In order to maximize sequential read throughput, we combine each group from K sequential original blocks
  • In order to evenly distribute download requests between all hosts, we evenly spread ECC blocks over hosts
  • In order to max out sequential read throughput, we ensure that if original block R is placed on the host Q, then block R+1 is always placed on host (Q+1) mod N (round-robin placement)

Simple layout (strict math definition)

The algorithm describes how user-provided data are emplaced to hosts, and ECC data are generated from them and emplaced too:

  • On input, we have S*K user-provided data blocks INPUT[I], I=0..S*K-1
  • On output, we have N nodes each storing S blocks OUTPUT[I][J], I=0..N-1, J=0..S-1

First, let's organize data into recovery groups and generate ECC:

  • GROUPED[I][J] = INPUT[I+J*K], I=0..K-1, J=0..S-1 are the original data
  • GROUPED[K .. K+M-1][J] = ECC(GROUPED[0..K-1][J]), J=0..S-1 are the computed ECC data - M recovery blocks computed from K original blocks of the same recovery group

Now, the naive block dispersal (first K nodes contains all data blocks):

  • OUTPUT[I][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

And balanced, round-robin block dispersal (data and ECC blocks are evenly distributed between nodes):

  • OUTPUT[(I+K*J) mod N][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

Block dispersal

Images are courtesy of @leobago

@dryajov
Copy link
Contributor Author

dryajov commented Jun 9, 2022

@Bulat-Ziganshin looks good!

Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)

Just to clarify, we're always treating a dataset as a single unit, this means that there isn't any "preference" in which blocks are lost, except for preventing entire columns from being lost.

Also, to clarify on the terminology, what precisely do you mean by "recovery group"? As far as I can infer, it would be what we refer to the "column" in the original writeup, i.e. an entire codeword?

@dryajov
Copy link
Contributor Author

dryajov commented Jun 9, 2022

@leobago the graphic seems to be inverted, K would be the column and S would be the row, consequently, M is also stacked in the column and additional parity columns are simply appended to the original dataset, which end up in the S dimension. This doesn't change with the reshuffling proposed by @Bulat-Ziganshin.

@Bulat-Ziganshin
Copy link

Bulat-Ziganshin commented Jun 16, 2022

As the third alternative (layout "C"), @dryajov proposed "diagonale layout" which allows to append ECC blocks to the datastore, simplify indexing, and simplify adding recovery on top of pure data or replacing recovery scheme with a different one.

I will show it as an example first. Let's we have 3 groups of 5 blocks each and want to add 2 parity blocks to each group. So:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  => Data blocks
15 16 17 18 19 20  => Parity blocks

Their layout on 5+2=7 nodes will be:

0  7 14  => Node 0
1  8 15  => Node 1
2  9 16  => Node 2
3 10 17  => Node 3
4 11 18  => Node 4
5 12 19  => Node 5
6 13 20  => Node 6

Dmitry's idea is to place recovery groups to "diagonales" of this matrix, i.e.:

0   8   16  3   11  19  6   => Recovery group 0
7   15  2   10  18  5   13  => Recovery group 1
14  1   9   17  4   12  20  => Recovery group 2

i.e. each group includes exactly one block on each line (i.e. each node), and shifts one position right on each next line.

Mathematically, this means that group #I includes exactly one block from each server #J - the block placed at the column (I+J) mod S. This block absolute number is ((I+J) mod S) * N + J.

So, the recovery group #I includes blocks with absolute numbers ((I+J) mod S) * N + J for each J in 0..N-1 (N=K+M is 7=5+2, and S=3 in this example).

@Bulat-Ziganshin
Copy link

The "layout B" has advantage for sequential reading when some nodes lost: if we have M+K encoding, then we can receive one block per node from any M nodes and perfrom a single decoding operation to recover M sequential data blocks (because they belongs to the single recovery group). In layout A and layout C, sequential M blocks are distributed among different recovery groups.

F.e. in the picture below we can ready any 4 blocks of [1, 2, 3, 4, P1, P2] and recover first 4 blocks of the file using a single decode operation. With layout A or C, it requires to read more data and perform multiple decoding operations.

image

@dryajov
Copy link
Contributor Author

dryajov commented Jul 21, 2022

The "layout B"

I'm not sure if I follow which layout is which at this point :). It would be helpful if we can write a short summary of each for reference so we can all follow, maybe we can even do this in a separate writeup (in discussions) and reference it here.

Copy link
Contributor

@leobago leobago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on our Erasure Coding write-up, and I thought about checking this document, so I left some comments for improvement. I think the overall explanation is very good, except for a few parts that can be improved, I think I can take a big chunk of this and complement it with other work we have done in order to have a document ready for posting soon.


## Overview

Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explain why it provides greater redundancy at lower storage overhead. (Even if it is the 101 of erasure codes).
Checking only a percentage of the blocks is probabilistically correct. However, to have 100% guarantee you still need to check at least K blocks, which comes back to same size of the file.
It would be good to define the stragglers problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking only a percentage of the blocks is probabilistically correct. However, to have 100% guarantee you still need to check at least K blocks, which comes back to same size of the file.

The probabilistic assumption here is negligible, which makes this practically equivalent - i.e. 99.99999999....% ~= 100%?


Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem".

We employ an Reed-Solomon code with configurable $K$ and $M$ parameters per dataset and an interleaving structure that allows us to overcome the limitation of a small Galois field (GF) size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explain this GF limitation.


The size of the codeword determines the maximum number of symbols that an error correcting code can encode and decode as a unit. For example, a Reed-Solomon code that uses a Galois field of $2^8$ can only encode and decode 256 symbols together. In other words, the size of the field imposes a natural limit on the size of the data that can be coded together.

Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the explanation is great, I don't think the argument is completely correct. One can encode multi-terabyte files by simply dividing them into smaller segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the context of small vs large codeword codes which allow encoding large chunks of data without (as you suggest), having to split them it into smaller chunks. This is prelude/setup for the following sections which lays out the interleaving, which is the "dividing them into smaller segments"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be further extended to clarify this context?


Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data.

## Interleaving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole section needs to be better explained, it is not clear to me what you are trying to say here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't a simple topic to grasp and it might require a deeper explanation. I'm trying to elucidate the difference between large and small codeword sizes and why they are a limitation in the context of some GF and Reed-Solomon implementations, which effectively limit the size of the field and as a consequence the amount of data that is being encoded as a whole.

The problem here is that most of the commonly used algorithms so far have been O(n^2), which practically limits the code to be a small codeword. However, this isn't the case for all implementations anymore and more modern algorithms allow working with much larger fields, for example FastECC and Leopard.

Understanding the meaning of codeword size is important because this is why we need to do interleaving in the first place. I'm obviously doing a bad job explaining this.


A secondary but important requirement is to allow appending data without having to re-encode the entire dataset.

### Codex Interleaving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is nicely explained


Moreover, the resulting dataset is still systematic, which means that the original blocks are unchanged and can be used without prior decoding.

## Data placement and erasures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely explained as well


However, the code is $K$ rows strong only if we operate under the assumption that each failure is independent of each other. Thus, it is a requirement that each row is placed independently. Moreover, the overall strength of the code decreases based on the number of dependent failures. If we place $N$ rows on two independent locations, we can only tolerate $M=N/2$ failures, three will allow tolerating $M=N/3$ failures and so on. Hence, the code is only as strong as the number of independent locations each element is stored on. Thus, each row and better yet, each element (block) of the matrix should be stored independently and in a pattern mitigating dependence, meaning that placing elements of the same column together should be avoided.

### Load balancing retrieval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear section. Maybe we can extend it by exploring some placing strategies we discussed with Bulat.

Copy link

@Bulat-Ziganshin Bulat-Ziganshin Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I volunteered to write a text with comparison of the 3 placing strategies, with ultimate goal to make a post for our blog


Some placing strategies will be explored in a further document.

### Adversarial vs random erasures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. Perhaps add a link to our PoR docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants