-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
erasure coding design overview #85
base: master
Are you sure you want to change the base?
Conversation
I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering. |
What do you mean by the other way around? Not sure if I follow :) |
What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern. |
I see, yes this can make sense but it heavily depends on the placement we choose. I think this needs to be further clarified in the text and worded in such a way that it is clear that placement considerations would affect interleaving and vice-versa. |
I think we can leave it as it is for now (with the appropriate clarification), since this is how it is implemented currently, but we can change it if it becomes a bottleneck. This is one part that might also be per-dataset configurable... |
If we consider that a slot is a row (row: blocks belonging to different encoding columns to disperse erasures) then the ordering proposed by @dryajov will be beneficial for those reading sequential sections of the dataset. The other configuration (having sequential blocks 1, 2, 3.. in a column) would only provide acceleration benefits during the encoding. Assuming we encode once and read several times, the ordering proposed by @dryajov seems more beneficial to me. |
We may have several different layouts. F.e. let's specify one of them, a simplified one with fair distribution of download bandwidth: User provides For the first K nodes, block number Y on node X = original block number For the last M nodes, block number Y on node X is a part of recovery group consisting from blocks number Y on all K+M nodes, i.e. original blocks Note however that this scheme places all original data to the first K nodes and all recovery data to the last M nodes. If we want to spread recovery data fairly over all nodes, we should reshuffle data between servers in each recovery group (note that a single recovery group still occupies block Y on all K+M severs). The simplest way to do that is to shift every next group by one position relative to previous one:
Recovery blocks are shifted accordingly, filling remaining servers in each recovery group:
|
That's a writeup of arguments for using "layout B", i.e. layout described in my previous message, trying to put it in the form suitable for inclusion in the document. We have improved layout compared to the previous version, now shifting each next group in the balanced dispersal by K blocks instead of 1. PreconditionsWe assume that:
Simple layout (reasoning and textual description)The setting:
We propose here layout only for the simple case of H=N hosts, each storing S blocks. More complex layouts are TBD. Based on assumptions above, we choose the following layout:
Simple layout (strict math definition)The algorithm describes how user-provided data are emplaced to hosts, and ECC data are generated from them and emplaced too:
First, let's organize data into recovery groups and generate ECC:
Now, the naive block dispersal (first K nodes contains all data blocks):
And balanced, round-robin block dispersal (data and ECC blocks are evenly distributed between nodes):
Images are courtesy of @leobago |
@Bulat-Ziganshin looks good!
Just to clarify, we're always treating a dataset as a single unit, this means that there isn't any "preference" in which blocks are lost, except for preventing entire columns from being lost. Also, to clarify on the terminology, what precisely do you mean by "recovery group"? As far as I can infer, it would be what we refer to the "column" in the original writeup, i.e. an entire codeword? |
@leobago the graphic seems to be inverted, |
As the third alternative (layout "C"), @dryajov proposed "diagonale layout" which allows to append ECC blocks to the datastore, simplify indexing, and simplify adding recovery on top of pure data or replacing recovery scheme with a different one. I will show it as an example first. Let's we have 3 groups of 5 blocks each and want to add 2 parity blocks to each group. So:
Their layout on 5+2=7 nodes will be:
Dmitry's idea is to place recovery groups to "diagonales" of this matrix, i.e.:
i.e. each group includes exactly one block on each line (i.e. each node), and shifts one position right on each next line. Mathematically, this means that group #I includes exactly one block from each server #J - the block placed at the column So, the recovery group #I includes blocks with absolute numbers |
The "layout B" has advantage for sequential reading when some nodes lost: if we have M+K encoding, then we can receive one block per node from any M nodes and perfrom a single decoding operation to recover M sequential data blocks (because they belongs to the single recovery group). In layout A and layout C, sequential M blocks are distributed among different recovery groups. F.e. in the picture below we can ready any 4 blocks of [1, 2, 3, 4, P1, P2] and recover first 4 blocks of the file using a single decode operation. With layout A or C, it requires to read more data and perform multiple decoding operations. |
I'm not sure if I follow which layout is which at this point :). It would be helpful if we can write a short summary of each for reference so we can all follow, maybe we can even do this in a separate writeup (in discussions) and reference it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on our Erasure Coding write-up, and I thought about checking this document, so I left some comments for improvement. I think the overall explanation is very good, except for a few parts that can be improved, I think I can take a big chunk of this and complement it with other work we have done in order to have a document ready for posting soon.
|
||
## Overview | ||
|
||
Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to explain why it provides greater redundancy at lower storage overhead. (Even if it is the 101 of erasure codes).
Checking only a percentage of the blocks is probabilistically correct. However, to have 100% guarantee you still need to check at least K blocks, which comes back to same size of the file.
It would be good to define the stragglers problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking only a percentage of the blocks is probabilistically correct. However, to have 100% guarantee you still need to check at least K blocks, which comes back to same size of the file.
The probabilistic assumption here is negligible, which makes this practically equivalent - i.e. 99.99999999....% ~= 100%?
|
||
Erasure coding has several complementing uses in the Codex system. It is superior to replication because it provides greater redundancy at lower storage overheads; it strengthens our remote storage auditing scheme by allowing to check only a percentage of the blocks rather than checking the entire dataset; and it helps with downloads as it allows retrieving $K$ arbitrary pieces from any nodes, thus avoiding the so called "stragglers problem". | ||
|
||
We employ an Reed-Solomon code with configurable $K$ and $M$ parameters per dataset and an interleaving structure that allows us to overcome the limitation of a small Galois field (GF) size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to explain this GF limitation.
|
||
The size of the codeword determines the maximum number of symbols that an error correcting code can encode and decode as a unit. For example, a Reed-Solomon code that uses a Galois field of $2^8$ can only encode and decode 256 symbols together. In other words, the size of the field imposes a natural limit on the size of the data that can be coded together. | ||
|
||
Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the explanation is great, I don't think the argument is completely correct. One can encode multi-terabyte files by simply dividing them into smaller segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the context of small vs large codeword codes which allow encoding large chunks of data without (as you suggest), having to split them it into smaller chunks. This is prelude/setup for the following sections which lays out the interleaving, which is the "dividing them into smaller segments"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be further extended to clarify this context?
|
||
Why not use a larger field? The limitations are mostly practical, either memory or CPU bound, or both. For example, most implementation rely on one or more arithmetic tables which have a linear storage overhead proportional to the size of the field. In other words, a fields of $2^{32}$ would require generating and storing several 4GB tables and it would still only allow coding 4GB of data at a time, clearly this isn't enough when the average high definition video file is several times larger than that and specially when the expectation is to allow handling very large, potentially terabyte size datasets, routinely employed in science and big-data. | ||
|
||
## Interleaving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole section needs to be better explained, it is not clear to me what you are trying to say here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this isn't a simple topic to grasp and it might require a deeper explanation. I'm trying to elucidate the difference between large and small codeword sizes and why they are a limitation in the context of some GF and Reed-Solomon implementations, which effectively limit the size of the field and as a consequence the amount of data that is being encoded as a whole.
The problem here is that most of the commonly used algorithms so far have been O(n^2), which practically limits the code to be a small codeword. However, this isn't the case for all implementations anymore and more modern algorithms allow working with much larger fields, for example FastECC and Leopard.
Understanding the meaning of codeword size is important because this is why we need to do interleaving in the first place. I'm obviously doing a bad job explaining this.
|
||
A secondary but important requirement is to allow appending data without having to re-encode the entire dataset. | ||
|
||
### Codex Interleaving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is nicely explained
|
||
Moreover, the resulting dataset is still systematic, which means that the original blocks are unchanged and can be used without prior decoding. | ||
|
||
## Data placement and erasures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely explained as well
|
||
However, the code is $K$ rows strong only if we operate under the assumption that each failure is independent of each other. Thus, it is a requirement that each row is placed independently. Moreover, the overall strength of the code decreases based on the number of dependent failures. If we place $N$ rows on two independent locations, we can only tolerate $M=N/2$ failures, three will allow tolerating $M=N/3$ failures and so on. Hence, the code is only as strong as the number of independent locations each element is stored on. Thus, each row and better yet, each element (block) of the matrix should be stored independently and in a pattern mitigating dependence, meaning that placing elements of the same column together should be avoided. | ||
|
||
### Load balancing retrieval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear section. Maybe we can extend it by exploring some placing strategies we discussed with Bulat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I volunteered to write a text with comparison of the 3 placing strategies, with ultimate goal to make a post for our blog
|
||
Some placing strategies will be explored in a further document. | ||
|
||
### Adversarial vs random erasures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. Perhaps add a link to our PoR docs?
Just a wip for now, some sections are still missing