erasure coding design overview #85

dryajov · 2022-05-28T18:26:08Z

Just a wip for now, some sections are still missing

design/erasure-coding.md

cskiraly · 2022-05-31T12:41:11Z

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

dryajov · 2022-05-31T15:55:03Z

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

cskiraly · 2022-06-02T00:56:17Z

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.

dryajov · 2022-06-02T01:00:01Z

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.

I see, yes this can make sense but it heavily depends on the placement we choose. I think this needs to be further clarified in the text and worded in such a way that it is clear that placement considerations would affect interleaving and vice-versa.

dryajov · 2022-06-02T01:02:16Z

I think we can leave it as it is for now (with the appropriate clarification), since this is how it is implemented currently, but we can change it if it becomes a bottleneck. This is one part that might also be per-dataset configurable...

leobago · 2022-06-02T08:11:43Z

If we consider that a slot is a row (row: blocks belonging to different encoding columns to disperse erasures) then the ordering proposed by @dryajov will be beneficial for those reading sequential sections of the dataset. The other configuration (having sequential blocks 1, 2, 3.. in a column) would only provide acceleration benefits during the encoding. Assuming we encode once and read several times, the ordering proposed by @dryajov seems more beneficial to me.

Bulat-Ziganshin · 2022-06-02T23:26:06Z

We may have several different layouts. F.e. let's specify one of them, a simplified one with fair distribution of download bandwidth:

User provides S*K blocks and requests K+M encoding saving data over K+M nodes (thus each node stores S blocks).

For the first K nodes, block number Y on node X = original block number Y*K+X (i.e. blocks 0, 1 .. K-1, K... goes into nodes 0, 1 .. K-1, 0...). In other words, original block number I goes into block number I/K on node I%K.

For the last M nodes, block number Y on node X is a part of recovery group consisting from blocks number Y on all K+M nodes, i.e. original blocks Y*K+i, i=0..K-1 plus M recovery blocks.

Note however that this scheme places all original data to the first K nodes and all recovery data to the last M nodes. If we want to spread recovery data fairly over all nodes, we should reshuffle data between servers in each recovery group (note that a single recovery group still occupies block Y on all K+M severs).

The simplest way to do that is to shift every next group by one position relative to previous one:

blocks 0 .. K-1 are mapped to servers 0 .. K-1, correspondingly
block K .. 2*K-1 are mapped to servers 1 .. K, correspondingly
block 2*K .. 3*K-1 are mapped to servers 2 .. K+1, correspondingly
and so on.

Recovery blocks are shifted accordingly, filling remaining servers in each recovery group:

recovery blocks 0 .. M-1 (i.e. recovering data from original blocks 0..K-1) are mapped to servers K .. K+M-1, correspondingly
recovery blocks M .. 2*M-1 are mapped to servers K+1 .. K+M-1, 0 correspondingly
recovery blocks 2*M .. 3*M-1 are mapped to servers K+2 .. K+M-1, 0, 1 correspondingly

Bulat-Ziganshin · 2022-06-09T03:16:36Z

That's a writeup of arguments for using "layout B", i.e. layout described in my previous message, trying to put it in the form suitable for inclusion in the document. We have improved layout compared to the previous version, now shifting each next group in the balanced dispersal by K blocks instead of 1.

Preconditions

We assume that:

The main source of large data losses are situations when an entire host drops (temporarily or permanently) or a host loses an entire hdd/ssd
- Thus, each recovery group should include as much hosts as it possible without degrading other characteristics
The dataset, that provided by user, has some internal structure and sequential data blocks are often more closely related than more distant blocks, f.e. they may be parts of the same file or closely related files, if the dataset is an archive or a disk image
- Thus, reading of R sequential blocks is more likely than reading of other, arbitrary chosen sets of R blocks
- Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)
The download requests should be fairly distributed among all hosts

Simple layout (reasoning and textual description)

The setting:

User selects recovery groups consisting of N=K+M blocks where K blocks represent the original data and M blocks are the computed ECC data
The entire dataset provided by user consists of S such groups
User selects the number of storage hosts H, and amount of data stored on the each host

We propose here layout only for the simple case of H=N hosts, each storing S blocks. More complex layouts are TBD.

Based on assumptions above, we choose the following layout:

In order to maximize recovery chances, we distribute each recovery group over all N hosts. This allows to recover all data even if up to M hosts are offline for any reason
In order to maximize sequential read throughput, we combine each group from K sequential original blocks
In order to evenly distribute download requests between all hosts, we evenly spread ECC blocks over hosts
In order to max out sequential read throughput, we ensure that if original block R is placed on the host Q, then block R+1 is always placed on host (Q+1) mod N (round-robin placement)

Simple layout (strict math definition)

The algorithm describes how user-provided data are emplaced to hosts, and ECC data are generated from them and emplaced too:

On input, we have S*K user-provided data blocks INPUT[I], I=0..S*K-1
On output, we have N nodes each storing S blocks OUTPUT[I][J], I=0..N-1, J=0..S-1

First, let's organize data into recovery groups and generate ECC:

GROUPED[I][J] = INPUT[I+J*K], I=0..K-1, J=0..S-1 are the original data
GROUPED[K .. K+M-1][J] = ECC(GROUPED[0..K-1][J]), J=0..S-1 are the computed ECC data - M recovery blocks computed from K original blocks of the same recovery group

Now, the naive block dispersal (first K nodes contains all data blocks):

OUTPUT[I][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

And balanced, round-robin block dispersal (data and ECC blocks are evenly distributed between nodes):

OUTPUT[(I+K*J) mod N][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

Images are courtesy of @leobago

dryajov · 2022-06-09T13:33:44Z

@Bulat-Ziganshin looks good!

Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)

Just to clarify, we're always treating a dataset as a single unit, this means that there isn't any "preference" in which blocks are lost, except for preventing entire columns from being lost.

Also, to clarify on the terminology, what precisely do you mean by "recovery group"? As far as I can infer, it would be what we refer to the "column" in the original writeup, i.e. an entire codeword?

dryajov · 2022-06-09T13:36:10Z

@leobago the graphic seems to be inverted, K would be the column and S would be the row, consequently, M is also stacked in the column and additional parity columns are simply appended to the original dataset, which end up in the S dimension. This doesn't change with the reshuffling proposed by @Bulat-Ziganshin.

Bulat-Ziganshin · 2022-06-16T10:39:20Z

As the third alternative (layout "C"), @dryajov proposed "diagonale layout" which allows to append ECC blocks to the datastore, simplify indexing, and simplify adding recovery on top of pure data or replacing recovery scheme with a different one.

I will show it as an example first. Let's we have 3 groups of 5 blocks each and want to add 2 parity blocks to each group. So:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  => Data blocks
15 16 17 18 19 20  => Parity blocks

Their layout on 5+2=7 nodes will be:

0  7 14  => Node 0
1  8 15  => Node 1
2  9 16  => Node 2
3 10 17  => Node 3
4 11 18  => Node 4
5 12 19  => Node 5
6 13 20  => Node 6

Dmitry's idea is to place recovery groups to "diagonales" of this matrix, i.e.:

0   8   16  3   11  19  6   => Recovery group 0
7   15  2   10  18  5   13  => Recovery group 1
14  1   9   17  4   12  20  => Recovery group 2

i.e. each group includes exactly one block on each line (i.e. each node), and shifts one position right on each next line.

Mathematically, this means that group #I includes exactly one block from each server #J - the block placed at the column (I+J) mod S. This block absolute number is ((I+J) mod S) * N + J.

So, the recovery group #I includes blocks with absolute numbers ((I+J) mod S) * N + J for each J in 0..N-1 (N=K+M is 7=5+2, and S=3 in this example).

Bulat-Ziganshin · 2022-07-21T20:28:20Z

The "layout B" has advantage for sequential reading when some nodes lost: if we have M+K encoding, then we can receive one block per node from any M nodes and perfrom a single decoding operation to recover M sequential data blocks (because they belongs to the single recovery group). In layout A and layout C, sequential M blocks are distributed among different recovery groups.

F.e. in the picture below we can ready any 4 blocks of [1, 2, 3, 4, P1, P2] and recover first 4 blocks of the file using a single decode operation. With layout A or C, it requires to read more data and perform multiple decoding operations.

dryajov · 2022-07-21T23:03:18Z

The "layout B"

I'm not sure if I follow which layout is which at this point :). It would be helpful if we can write a short summary of each for reference so we can all follow, maybe we can even do this in a separate writeup (in discussions) and reference it here.

leobago

I was working on our Erasure Coding write-up, and I thought about checking this document, so I left some comments for improvement. I think the overall explanation is very good, except for a few parts that can be improved, I think I can take a big chunk of this and complement it with other work we have done in order to have a document ready for posting soon.

leobago · 2022-08-02T06:50:55Z

design/erasure-coding.md

+
+## Overview
+
+Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem".


It would be good to explain why it provides greater redundancy at lower storage overhead. (Even if it is the 101 of erasure codes).
Checking only a percentage of the blocks is probabilistically correct. However, to have 100% guarantee you still need to check at least K blocks, which comes back to same size of the file.
It would be good to define the stragglers problem.

Checking only a percentage of the blocks is probabilistically correct. However, to have 100% guarantee you still need to check at least K blocks, which comes back to same size of the file.

The probabilistic assumption here is negligible, which makes this practically equivalent - i.e. 99.99999999....% ~= 100%?

leobago · 2022-08-02T06:51:32Z

design/erasure-coding.md

+
+Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem".
+
+We employ an Reed-Solomon code with configurable $K$ and $M$ parameters per dataset and an interleaving structure that allows us to overcome the limitation of a small Galois field (GF) size.


It would be good to explain this GF limitation.

leobago · 2022-08-02T06:58:51Z

design/erasure-coding.md

+
+The size of the codeword determines the maximum number of symbols that an error correcting code can encode and decode as a unit. For example, a Reed-Solomon code that uses a Galois field of $2^8$ can only encode and decode 256 symbols together. In other words, the size of the field imposes a natural limit on the size of the data that can be coded together.
+
+Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data.


While the explanation is great, I don't think the argument is completely correct. One can encode multi-terabyte files by simply dividing them into smaller segments.

This is in the context of small vs large codeword codes which allow encoding large chunks of data without (as you suggest), having to split them it into smaller chunks. This is prelude/setup for the following sections which lays out the interleaving, which is the "dividing them into smaller segments"

Maybe this can be further extended to clarify this context?

leobago · 2022-08-02T07:02:11Z

design/erasure-coding.md

+
+Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data.
+
+## Interleaving


I think this whole section needs to be better explained, it is not clear to me what you are trying to say here.

Yes, this isn't a simple topic to grasp and it might require a deeper explanation. I'm trying to elucidate the difference between large and small codeword sizes and why they are a limitation in the context of some GF and Reed-Solomon implementations, which effectively limit the size of the field and as a consequence the amount of data that is being encoded as a whole.

The problem here is that most of the commonly used algorithms so far have been O(n^2), which practically limits the code to be a small codeword. However, this isn't the case for all implementations anymore and more modern algorithms allow working with much larger fields, for example FastECC and Leopard.

Understanding the meaning of codeword size is important because this is why we need to do interleaving in the first place. I'm obviously doing a bad job explaining this.

leobago · 2022-08-02T07:14:29Z

design/erasure-coding.md

+
+A secondary but important requirement is to allow appending data without having to re-encode the entire dataset.
+
+### Codex Interleaving


This section is nicely explained

leobago · 2022-08-02T07:17:11Z

design/erasure-coding.md

+
+Moreover, the resulting dataset is still systematic, which means that the original blocks are unchanged and can be used without prior decoding.
+
+## Data placement and erasures


Nicely explained as well

leobago · 2022-08-02T07:17:53Z

design/erasure-coding.md

+
+However, the code is $K$ rows strong only if we operate under the assumption that each failure is independent of each other. Thus, it is a requirement that each row is placed independently. Moreover, the overall strength of the code decreases based on the number of dependent failures. If we place $N$ rows on two independent locations, we can only tolerate $M=N/2$ failures, three will allow tolerating $M=N/3$ failures and so on. Hence, the code is only as strong as the number of independent locations each element is stored on. Thus, each row and better yet, each element (block) of the matrix should be stored independently and in a pattern mitigating dependence, meaning that placing elements of the same column together should be avoided.
+
+### Load balancing retrieval


Very clear section. Maybe we can extend it by exploring some placing strategies we discussed with Bulat.

I volunteered to write a text with comparison of the 3 placing strategies, with ultimate goal to make a post for our blog

leobago · 2022-08-02T07:18:33Z

design/erasure-coding.md

+
+Some placing strategies will be explored in a further document.
+
+### Adversarial vs random erasures


Good. Perhaps add a link to our PoR docs?

dryajov added 3 commits May 28, 2022 12:25

erasure coding design overview

4700e72

adding section on data placement and erasures

78df544

expanding on data placement and load balancing

5dfc045

dryajov marked this pull request as ready for review May 31, 2022 00:54

formating

afcab34

leobago reviewed May 31, 2022

View reviewed changes

design/erasure-coding.md Outdated Show resolved Hide resolved

leobago reviewed May 31, 2022

View reviewed changes

design/erasure-coding.md Show resolved Hide resolved

leobago reviewed May 31, 2022

View reviewed changes

design/erasure-coding.md Outdated Show resolved Hide resolved

dryajov added 3 commits May 31, 2022 09:56

fixed figure and minor formating

ae85a32

small typo

1669438

more wording

8c45b1f

dryajov mentioned this pull request Jun 28, 2022

Document current erasure coding design #70

Open

Bulat-Ziganshin mentioned this pull request Jul 29, 2022

Remove padding during retrieval via REST API codex-storage/nim-codex#98

Closed

leobago reviewed Aug 2, 2022

View reviewed changes

Bulat-Ziganshin mentioned this pull request Aug 14, 2022

Block placement (Layouts) discussion #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

erasure coding design overview #85

erasure coding design overview #85

dryajov commented May 28, 2022 •

edited

Loading

cskiraly commented May 31, 2022

dryajov commented May 31, 2022

cskiraly commented Jun 2, 2022

dryajov commented Jun 2, 2022

dryajov commented Jun 2, 2022

leobago commented Jun 2, 2022

Bulat-Ziganshin commented Jun 2, 2022

Bulat-Ziganshin commented Jun 9, 2022 •

edited

Loading

dryajov commented Jun 9, 2022

dryajov commented Jun 9, 2022

Bulat-Ziganshin commented Jun 16, 2022 •

edited

Loading

Bulat-Ziganshin commented Jul 21, 2022

dryajov commented Jul 21, 2022 •

edited

Loading

leobago left a comment

leobago Aug 2, 2022

dryajov Aug 2, 2022

leobago Aug 2, 2022

leobago Aug 2, 2022

dryajov Aug 2, 2022

dryajov Aug 2, 2022

leobago Aug 2, 2022

dryajov Aug 2, 2022

leobago Aug 2, 2022

leobago Aug 2, 2022

leobago Aug 2, 2022

Bulat-Ziganshin Aug 2, 2022 •

edited

Loading

leobago Aug 2, 2022


		## Overview

		Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem".


		Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem".

		We employ an Reed-Solomon code with configurable $K$ and $M$ parameters per dataset and an interleaving structure that allows us to overcome the limitation of a small Galois field (GF) size.


		The size of the codeword determines the maximum number of symbols that an error correcting code can encode and decode as a unit. For example, a Reed-Solomon code that uses a Galois field of $2^8$ can only encode and decode 256 symbols together. In other words, the size of the field imposes a natural limit on the size of the data that can be coded together.

		Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data.


		Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data.

		## Interleaving


		A secondary but important requirement is to allow appending data without having to re-encode the entire dataset.

		### Codex Interleaving


		Moreover, the resulting dataset is still systematic, which means that the original blocks are unchanged and can be used without prior decoding.

		## Data placement and erasures


		However, the code is $K$ rows strong only if we operate under the assumption that each failure is independent of each other. Thus, it is a requirement that each row is placed independently. Moreover, the overall strength of the code decreases based on the number of dependent failures. If we place $N$ rows on two independent locations, we can only tolerate $M=N/2$ failures, three will allow tolerating $M=N/3$ failures and so on. Hence, the code is only as strong as the number of independent locations each element is stored on. Thus, each row and better yet, each element (block) of the matrix should be stored independently and in a pattern mitigating dependence, meaning that placing elements of the same column together should be avoided.

		### Load balancing retrieval


		Some placing strategies will be explored in a further document.

		### Adversarial vs random erasures

erasure coding design overview #85

Are you sure you want to change the base?

erasure coding design overview #85

Conversation

dryajov commented May 28, 2022 • edited Loading

cskiraly commented May 31, 2022

dryajov commented May 31, 2022

cskiraly commented Jun 2, 2022

dryajov commented Jun 2, 2022

dryajov commented Jun 2, 2022

leobago commented Jun 2, 2022

Bulat-Ziganshin commented Jun 2, 2022

Bulat-Ziganshin commented Jun 9, 2022 • edited Loading

Preconditions

Simple layout (reasoning and textual description)

Simple layout (strict math definition)

dryajov commented Jun 9, 2022

dryajov commented Jun 9, 2022

Bulat-Ziganshin commented Jun 16, 2022 • edited Loading

Bulat-Ziganshin commented Jul 21, 2022

dryajov commented Jul 21, 2022 • edited Loading

leobago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bulat-Ziganshin Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dryajov commented May 28, 2022 •

edited

Loading

Bulat-Ziganshin commented Jun 9, 2022 •

edited

Loading

Bulat-Ziganshin commented Jun 16, 2022 •

edited

Loading

dryajov commented Jul 21, 2022 •

edited

Loading

Bulat-Ziganshin Aug 2, 2022 •

edited

Loading