-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download files without padding #218
Conversation
To do:
|
d128fd4
to
6d3f6d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part, but as I noted elsewhere, we should not overspecialize StoreStream/ManifestStream. We don't have special cases for ECC anywhere in the application besides the actual erasure coding pice itself and we should keep it like this.
This change will break our current PoR as it requires generating authenticators for all the blocks, including the ECC blocks. Even if that changes we loose symmetry in the primitive and it becomes less useful overall.
I would suggest we do the following:
- StoreStream should allow retrieving both padded and unpadded blocks
- StoreStream should allow reading all the data in the manifest, regardless of wether it's protected or not
- If we incorporate the change I suggested above with
size(pad=true/false)
instead oforiginalBytesPadded
, we'll make the api more consistent - Finally, we should not specialize StoreStream, if we want to read only up to the original dataset size, we should construct an unprotected manifest from the protected one and use that as the seed for StoreStream. In fact, this allows the user to download either the ECC'ed dataset or the original one without doing anything special, we can easily add a flag to the rest api which will control that.
codex/manifest/manifest.nim
Outdated
@@ -25,8 +25,12 @@ import ./coders | |||
func len*(self: Manifest): int = | |||
self.blocks.len | |||
|
|||
func size*(self: Manifest): int = | |||
self.blocks.len * self.blockSize | |||
func originalBytesPadded*(self: Manifest): int = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just add this as unpaddedSize
in addition to size
? It's overall cleaner and easier to follow. An alternative would be to add a padded
flag to size
which whould make it more consistent across the other primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add padded
flag to size
it should make it more consistent across the other primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In old code, Manifest.len is a number of blocks, and Manifest.size is a number of bytes - these names don't look descriptive. So we named the new field originalBytes, and originalBytesPadded name follows the same approach.
I also thought about renaming len
to blockNum, but since Manifest partially emulates seq
API, len
still makes sense.
Now about padding. There are 3 lengths:
- original filesize
- original filesize padded to the block size
- (available only in protected Manifests) size including ECC data which is always padded
In order to make API somewhat consistent, I proposed that StoreStream can return only file 1 or file 2 of those 3. Size of file 1 is returned by originalBytes, so with originalBytesPadded returning size of file 2, I have everything to implement this concept.
If we sometimes need file 3 (original+parity data), we can implement the following API:
- StoreStream(pad=false) always returns file 1
- StoreStream(pad=true) returns file 2 for non-protected Manifest, and file 3 (+ECC) - for protected Manifest
And implement the following helper function:
func bytes*(self: Manifest, pad = true): int =
if self.pad:
self.len * self.blockSize
else:
self.originalBytes
that computes how much bytes corresponding StoreStream(Manifest, pad) will return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In old code, Manifest.len is a number of blocks, and Manifest.size is a number of bytes - these names don't look descriptive. So we named the new field originalBytes, and originalBytesPadded name follows the same approach.
originalBytes
can be considered an internal (private/protected) field for all intents and purposes. To me len
and size
make sense because as you noted the manifest is akin to a seq
in some sense.
And implement the following helper function:
Yep, looks correct to me and bytes
makes as much sense as size
- I'm either way 👍
Now about padding. There are 3 lengths:
- original filesize
- original filesize padded to the block size
- (available only in protected Manifests) size including ECC data which is always padded
In order to make API somewhat consistent, I proposed that StoreStream can return only file 1 or file 2 of those 3. Size of file 1 is returned by originalBytes, so with originalBytesPadded returning size of file 2, I have everything to implement this concept.
I'm proposing that we make no distinction between protected and unprotected manifests and treat them the same:
- unprotected manifests will be either padded or unpadded
- protected manifests will be always padded - i.e. the
pad
flag has no effect - if you only have the protected manifest and you want to read the unprotected data it's always easy to construct an unprotected manifest
The advantage of this is that it reduces the need to handle corner cases and simplifies the problem considerably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so I will change the helper function, that computes how many bytes corresponding StoreStream(Manifest, pad) will return, to:
func bytes*(self: Manifest, pad = true): int =
if self.pad or self.protected:
self.len * self.blockSize
else:
self.originalBytes
and it will cover all 3 cases:
- actual file size for unprotected manifest, pad=false
- padded file size for unprotected manifest, pad=true
- size of data+parity blocks for protected manifest, pad ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change in 83fb9a4
and I will rename StoreStream to ManifestStream - afair, we agreed on that? |
Commit 8ade55c failed first test attempt on "windows-amd64 (Nim version-1-2)" in this place:
but subsequent attempts went fine. It may be due to 'Manifest.verify' which now tests all manifests, some of them may be created incorrectly (not preserving originalBytes or having incorrect originalLen). |
436b14c
to
da0b8de
Compare
codex/manifest/manifest.nim
Outdated
let originalLen = (if self.protected: self.originalLen else: self.len) | ||
|
||
if divUp(self.originalBytes, self.blockSize) != originalLen: | ||
raise newException(Defect, "Broken manifest: wrong originalBytes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't raise Defect
here or anywhere where the data might come from the network, in this case, if a malformed manifest is sent to us it will crash the application. I would use our standard ?!void
here with an appropriate error message.
Looks much better now 👍 I would take care of |
fa654fb
to
700a6f8
Compare
Note that the wire format of Manifest was changed, so we need to recreate all BlockStores
…fest, pad) will return
Also set originalBytes in each Manifest creation/update scenario
1. Instead of copy-pasting code from node.nim, new test calls node.store() and node.retrieve() in order to check that they can correctly store and then retrieve data 2. New test compares only file contents, manifest contents considered an implementation detail 3. New test chunks at odd chunkSize=BlockSize/1.618 in order to ensure that data retrieved correctly even when buffer sizes mismatch
700a6f8
to
d95cdb8
Compare
Note that the wire format of Manifest was changed, so we need to recreate all BlockStores.
Closes #98.