Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download files without padding #218

Merged
merged 15 commits into from
Aug 24, 2022
Merged

Conversation

Bulat-Ziganshin
Copy link
Contributor

Note that the wire format of Manifest was changed, so we need to recreate all BlockStores.

Closes #98.

@Bulat-Ziganshin
Copy link
Contributor Author

Bulat-Ziganshin commented Aug 18, 2022

To do:

  • tests
  • remove RabinChunker
  • use [padded] original filesize for StoreStream.size
  • u64 first in readBytes = min([nbytes - read, self.manifest.blockSize - blockOffset, self.size - self.offset])
  • replace emptyBlock with zeroMem
  • manifest encode/decode should check originalBytes~=blockSize*originalLen and originalLen ~= K/(K+M)*len(blocks)
  • test that node.store+retrieve indeed returns an original file [+padding], even for protected Manifest

@Bulat-Ziganshin Bulat-Ziganshin force-pushed the 98-no-more-padding-on-download branch 2 times, most recently from d128fd4 to 6d3f6d0 Compare August 19, 2022 10:46
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, but as I noted elsewhere, we should not overspecialize StoreStream/ManifestStream. We don't have special cases for ECC anywhere in the application besides the actual erasure coding pice itself and we should keep it like this.

This change will break our current PoR as it requires generating authenticators for all the blocks, including the ECC blocks. Even if that changes we loose symmetry in the primitive and it becomes less useful overall.

I would suggest we do the following:

  • StoreStream should allow retrieving both padded and unpadded blocks
  • StoreStream should allow reading all the data in the manifest, regardless of wether it's protected or not
  • If we incorporate the change I suggested above with size(pad=true/false) instead of originalBytesPadded, we'll make the api more consistent
  • Finally, we should not specialize StoreStream, if we want to read only up to the original dataset size, we should construct an unprotected manifest from the protected one and use that as the seed for StoreStream. In fact, this allows the user to download either the ECC'ed dataset or the original one without doing anything special, we can easily add a flag to the rest api which will control that.

@@ -25,8 +25,12 @@ import ./coders
func len*(self: Manifest): int =
self.blocks.len

func size*(self: Manifest): int =
self.blocks.len * self.blockSize
func originalBytesPadded*(self: Manifest): int =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add this as unpaddedSize in addition to size? It's overall cleaner and easier to follow. An alternative would be to add a padded flag to size which whould make it more consistent across the other primitives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add padded flag to size it should make it more consistent across the other primitives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In old code, Manifest.len is a number of blocks, and Manifest.size is a number of bytes - these names don't look descriptive. So we named the new field originalBytes, and originalBytesPadded name follows the same approach.

I also thought about renaming len to blockNum, but since Manifest partially emulates seq API, len still makes sense.

Now about padding. There are 3 lengths:

  1. original filesize
  2. original filesize padded to the block size
  3. (available only in protected Manifests) size including ECC data which is always padded

In order to make API somewhat consistent, I proposed that StoreStream can return only file 1 or file 2 of those 3. Size of file 1 is returned by originalBytes, so with originalBytesPadded returning size of file 2, I have everything to implement this concept.

If we sometimes need file 3 (original+parity data), we can implement the following API:

  1. StoreStream(pad=false) always returns file 1
  2. StoreStream(pad=true) returns file 2 for non-protected Manifest, and file 3 (+ECC) - for protected Manifest

And implement the following helper function:

func bytes*(self: Manifest, pad = true): int =
  if self.pad:
    self.len * self.blockSize
  else:
    self.originalBytes

that computes how much bytes corresponding StoreStream(Manifest, pad) will return

Copy link
Contributor

@dryajov dryajov Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In old code, Manifest.len is a number of blocks, and Manifest.size is a number of bytes - these names don't look descriptive. So we named the new field originalBytes, and originalBytesPadded name follows the same approach.

originalBytes can be considered an internal (private/protected) field for all intents and purposes. To me len and size make sense because as you noted the manifest is akin to a seq in some sense.

And implement the following helper function:

Yep, looks correct to me and bytes makes as much sense as size - I'm either way 👍

Now about padding. There are 3 lengths:

  1. original filesize
  2. original filesize padded to the block size
  3. (available only in protected Manifests) size including ECC data which is always padded

In order to make API somewhat consistent, I proposed that StoreStream can return only file 1 or file 2 of those 3. Size of file 1 is returned by originalBytes, so with originalBytesPadded returning size of file 2, I have everything to implement this concept.

I'm proposing that we make no distinction between protected and unprotected manifests and treat them the same:

  • unprotected manifests will be either padded or unpadded
  • protected manifests will be always padded - i.e. the pad flag has no effect
  • if you only have the protected manifest and you want to read the unprotected data it's always easy to construct an unprotected manifest

The advantage of this is that it reduces the need to handle corner cases and simplifies the problem considerably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I will change the helper function, that computes how many bytes corresponding StoreStream(Manifest, pad) will return, to:

func bytes*(self: Manifest, pad = true): int =
  if self.pad or self.protected:
    self.len * self.blockSize
  else:
    self.originalBytes

and it will cover all 3 cases:

  1. actual file size for unprotected manifest, pad=false
  2. padded file size for unprotected manifest, pad=true
  3. size of data+parity blocks for protected manifest, pad ignored

Copy link
Contributor Author

@Bulat-Ziganshin Bulat-Ziganshin Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change in 83fb9a4

@Bulat-Ziganshin
Copy link
Contributor Author

and I will rename StoreStream to ManifestStream - afair, we agreed on that?

@Bulat-Ziganshin
Copy link
Contributor Author

Commit 8ade55c failed first test attempt on "windows-amd64 (Nim version-1-2)" in this place:

[Suite] On-Chain Market
  [OK] fails to instantiate when contract does not have a signer
  [OK] knows signer address
  [OK] supports storage requests
  [OK] sets client address when submitting storage request
  [OK] can retrieve previously submitted requests
  [OK] supports request subscriptions
Traceback (most recent call last, using override)
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

but subsequent attempts went fine. It may be due to 'Manifest.verify' which now tests all manifests, some of them may be created incorrectly (not preserving originalBytes or having incorrect originalLen).

@Bulat-Ziganshin Bulat-Ziganshin force-pushed the 98-no-more-padding-on-download branch 2 times, most recently from 436b14c to da0b8de Compare August 21, 2022 22:16
let originalLen = (if self.protected: self.originalLen else: self.len)

if divUp(self.originalBytes, self.blockSize) != originalLen:
raise newException(Defect, "Broken manifest: wrong originalBytes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't raise Defect here or anywhere where the data might come from the network, in this case, if a malformed manifest is sent to us it will crash the application. I would use our standard ?!void here with an appropriate error message.

@dryajov
Copy link
Contributor

dryajov commented Aug 22, 2022

Looks much better now 👍

I would take care of fix a bug (blockSize and other fields aren't preserved) in in a different PR.

Note that the wire format of Manifest was changed, so we need to recreate all BlockStores
Also set originalBytes in each Manifest creation/update scenario
1. Instead of copy-pasting code from node.nim, new test calls node.store() and node.retrieve() in order to check that they can correctly store and then retrieve data
2. New test compares only file contents, manifest contents considered an implementation detail
3. New test chunks at odd chunkSize=BlockSize/1.618 in order to ensure that data retrieved correctly even when buffer sizes mismatch
@Bulat-Ziganshin Bulat-Ziganshin marked this pull request as ready for review August 24, 2022 11:58
@Bulat-Ziganshin Bulat-Ziganshin changed the title [WIP] Download files without padding Download files without padding Aug 24, 2022
@Bulat-Ziganshin Bulat-Ziganshin merged commit f24ded0 into main Aug 24, 2022
@Bulat-Ziganshin Bulat-Ziganshin deleted the 98-no-more-padding-on-download branch August 24, 2022 12:16
@Bulat-Ziganshin Bulat-Ziganshin self-assigned this Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove padding during retrieval via REST API
2 participants