-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove padding during retrieval via REST API #98
Comments
Hello 👋 The only problem I found is that the manifest constructor takes a sequence of blocks and this sort of change would make a mess of the constructor (since this would have to turn into a tuple). Personally my recommendation would be to dump the cid sequence in the constructor in favor of a more explicit API. Is the constructor cid part strictly needed? |
Hey @acud! Welcome! So, there is already a First, if you look at the original manifest definition: type
Manifest* = ref object of RootObj
rootHash*: ?Cid # root (tree) hash of the contained data set
blockSize*: int # size of each contained block (might not be needed if blocks are len-prefixed)
blocks*: seq[Cid] # block Cid
version*: CidVersion # Cid version
hcodec*: MultiCodec # Multihash codec
codec*: MultiCodec # Data set codec
case protected*: bool # Protected datasets have erasure coded info
of true:
K*: int # Number of blocks to encode
M*: int # Number of resulting parity blocks
originalCid*: Cid # The original Cid of the dataset being erasure coded
originalLen*: int # The length of the original manifest
else:
discard You'll see that if the manifest is Second, on upload the stream is fed to the chunker that splits the incoming stream into equal sized blocks. The chunker operates in two modes, with and without padding. The default is to pad, but since we want to count all the bytes as they come, we need to make sure that the last chunk/block is unpadded and pad it ourself. A better alternative would be to allow the chunker report the un-padded size it has read up until some point (or EOF) and set that on the manifest. So to summarize:
|
@dryajov personally I would try to leave the chunker out of it since ideally you could capture the value directly from the data stream of the HTTP request. The chunk may generate more data due to erasure coding, but the underlying data-layer subtree node spans remains the same (read: leaking stuff from the chunker would actually not give you any added value). Also it would result in some ugly tuple return values from the chunker which is not so nice to interact with. When looking at
The problem is that it is handled too high up in the stack and the context does not have access to the actual manifest (which is created in |
The HTTP api/interface if just one of many ways of invoking the underlying api exposed by the node, it is a very thin layer that should be isolated from the remainder of the the stack, because it can be switched out for
The chunker is the only thing that sits between the incoming stream and the padded blocks being produced by it. The only other way to do this, without making significant changes to the architecture, is to do padding manually and disable it on the chunker (it's just a flag). But there is really no reason to do that and architecturally there is 0 cost to adding an accessor to the chunker that reports on the amount of consumed bytes so far. This should be a simple method/func on the chunker called something like
There is a very good reason for not doing this, as I mentioned, this makes the entire architecture less modular and more leaky.
Not sure I follow, from what you're describing, this is already more or less how it works. The REST layer hands a stream to the node's |
@acud I've gone ahead and added a PR #110 to alleviate some of the glue code required to pump data between the http layer and the node and specially to avoid an unnecessary copying (one less buffer allocation). Also, as per our discussion, it might make sense to create a new type of stream that further augments the existing As another options, I still kinda like exposing this information to the chunker as well. Considering everything else, it is the least costly and invasive option and it doesn't feel out of place - after all, the chunker reads data and there is no reason not to make some stats around that available. |
I looked into code. My proposal:
I think it's all changes we need. Note that BlockStore will still store last block padded to blockSize, and all operations on blocks will use those padded blocks (but they will be filled up with zeros rather than garbage!) |
I think this introduces too many changes for something that can be accomplished with a very small change to the chunker itself (add a For one, StoreStream should work on the manifest dataset, not the original file, otherwise we'd have to account for padding manually, which is going to be a pain. The only place where the original file size matters is on download, if we store the original size in the manifest, it should be really easy to simply read up to the Btw, we don't have to return the I agree that we can remove rabin chunking (in fact, it's just a placeholder for now), but that doesn't merit getting rid of the chunker altogether, it's also useful in tests for example. |
yeah, I didn't realized that Chunker is used in other tests, not just tested by itself. So let's keep it. StoreStream is currently used for
I vote for adding StoreStream.new parameter,
yeah, existing Funny that it may be the actual reason of slow downloading in your current test!
agree |
Yeah, this is definitely a bandaid solution to prefetch some blocks, but I'm not sure if it actually makes any difference... Something to try out. |
Yeah, I think this is probably the best approach right now, lets do this. Let's just have a default parameter called |
so, we have two independent tasks. I think we agreed upon implementation of unpadding:
And for prefetching, we should modify fetchBlocksJob to push data into BufferStream with fixed-size queue. Also, Time to make new issue. |
Yep, sounds good and I like
I don't think this is what we want at least not in this context,
For erasure decoding, we probably want a more intelligent downloader that fetches in recovery groups and kicks in recovery as soon as it got any |
I should add that I'm not entirely sure if the prefetch is required, but it seemed like a good thing to try out originally, we should definitely experiment with this. |
but current NetworkStore.getBlock has only this code: let blk = await self.engine.requestBlock(cid)
# TODO: add block to the local store
return blk.some.success so implementing TODO may be the shortest path toward real prefetching.
there is small detail however. Current layout of recovery groups is that first group covers blocks 0, X, X*2..., second group covers blocks 1, X+1, X*2+1... Your idea assumes the "layout B" that we discussed in codex-storage/codex-research#85 (comment) |
Yes, but the engine has a reference to the blockstore and when the block eventually arrives, it will be stored here - https://github.com/status-im/nim-codex/blob/main/codex/blockexchange/engine/engine.nim#L256
Maybe, I haven't thought about this in some time, but it is definitely another layer that we might want to consider when making the choice for the layout. |
you won :) I made new issue for discussion of prefetching As for this issue, we agreed on implementation, but I will wait till you and Michael will push your PRs. |
The chunker pads chunks that are smaller than
BlockSize
.Currently, when a CID is retrieved via the REST API any padding is included. This can be seen when e.g. uploading a small text file via a node's REST API and then retrieving that file.
The original size of the upload should be tracked as metadata (in the manifest?) so that any padding can be removed during retrieval.
The text was updated successfully, but these errors were encountered: