Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DAG size / usage calculation #56

Open
walkah opened this issue May 30, 2022 · 8 comments
Open

Fix DAG size / usage calculation #56

walkah opened this issue May 30, 2022 · 8 comments

Comments

@walkah
Copy link
Member

walkah commented May 30, 2022

Currently, the directory usage calculation is incorrect. This means that we can not, for example, implement file usage quotas or even get an accurate accounting of disk required for go-ipfs nodes.

See https://talk.fission.codes/t/file-metrics-and-quotas-part-2/1303 for context

@justindotpub
Copy link

@walkah, some thoughts on this issue, perhaps for our 1:1 tomorrow...

What is the goal of this issue? Given a CID representing a UnixFS directory, return the number of bytes consumed by all files nested anywhere underneath the directory, after deduping? If so, are we referring to block level deduping, not file level? Are we talking about space consumed after encoding/compression, or raw data before encoding? Is UnixFS the wrong layer and we need something more general?

As it relates to WNFS, is the idea that we only want to know about the disk consumption for the contents of the latest version of the WNFS drive, for a metric that is easy for users to understand, not past content that is still being stored and might need to be recovered?

Has anyone tried ipfs dag stat CID? It uses go-merkledag/traverse.Traverse to sum up the size of block.RawData(), skipping duplicate (blocks?). Initial support for ipfs dag stat was added to go-ipfs in v0.7.0, released on 2020-09-22, and I wasn't sure what version Fission was running when the Discourse discussions took place at the end of 2020.

Is there a simple test that demonstrates the problem with the current DAG size calculation? My hope was that ipfs dag stat CID could be used, but I'm still deciphering what the value returned there means exactly and how to compare it to contrived examples on my local filesystem.

$ mkdir test1
$ echo hello > test1/file.txt
$ du -s test1 
8       test1

$ CID=$(ipfs add -r -Q test1)
$ ipfs dag stat $CID                                          
Size: 68, NumBlocks: 2
# What does this size represent?  Bytes, but of what?

# My own modified dag stat that uses nodestats.DataSize.
$ ../go-ipfs-dag-size/cmd/ipfs/ipfs dag stat $CID                                          
Size: 14, NumBlocks: 2
# 2x what I get from the shell.  What is DataSize?

As it relates to general issues related to size calculation in go-ipfs, I did see ipfs#5690, which is still broken on master as of today.

I'll continue exploring but I would welcome any background knowledge you might have that I'm missing.

@bmann
Copy link
Member

bmann commented Jun 6, 2022

@justincjohnson here’s my example file system which shows as 1.4PB file system in the public tree https://boris.files.fission.name/

so this is specifically for WNFS formatted IPFS blocks, return how much is stored. We do store versions — but this is deduped at the block level.

@justindotpub
Copy link

Thanks for the feedback @bmann. According to ipfs dag stat your drive is currently taking up about 4.3G. I'll look into the web ui code to see where it's getting 1.4PB from.

$ ipfs dag stat bafybeidv23lhcox2mcys63nhhd5g7fyitotfb4wikcpfsuszpaeczayjfm
Size: 4601236243, NumBlocks: 38203
$ numfmt --to iec-i --format "%8.4f" 4601236243
4.2853Gi

@bmann
Copy link
Member

bmann commented Jun 7, 2022

That’s sounds closer to correct @justincjohnson!

That’ll be a bug in the gateway code, so would be good to file.

The other half of this is running this command regularly / in some way to store in our DB for the user. That’ll need some work to scope, James linked to the Talk post with background.

@justindotpub
Copy link

Sounds good. This specific issue is about making sure go-ipfs gives us the correct numbers, so any other work can be scoped in a separate issue.

@walkah
Copy link
Member Author

walkah commented Jun 7, 2022

First off: this is super valuable digging. 🙏 thank you Justin! It sounds like dag stat is actually what we're looking for (or certainly close enough).

I would say let's treat this issue as:

  • Get a PR upstream that fixes the go-ipfs http browser (what do they call that UI?) to report the dag stat sizing.

As for this:

The other half of this is running this command regularly / in some way to store in our DB for the user. That’ll need some work to scope, James linked to the Talk post with background.

I did a little bit of digging - we don't need to run a command at all. We already do this in the fission-server code, I think we're just not doing it quite right. When we receive updates (new CIDs) we call IPFS stat (for example, on an app publish here- https://github.com/fission-suite/fission/blob/a0b46415e1e8b858aa6c6e503a7f2943a14e8218/fission-web-server/library/Fission/Web/Server/Types.hs#L753 ) and then we store the resulting size value in the DB for the app.

(Metabase, e.g., is just pulling values directly from a read only follower of the production database).

If I'm reading the code right (always a big "if"), that code calls this HTTP endpoint: https://docs.ipfs.io/reference/http/api/#api-v0-object-stat
(The docs list it directly as saying it's deprecated. Ooops!)

The docs recommends files/stat instead - https://docs.ipfs.io/reference/http/api/#api-v0-files-stat

@justincjohnson it might be worth a little investigation into the difference between files stat vs dag stat?

As soon as I hit "comment" here, I'll open an issue against the fission-server to update which endpoint we're pulling stat data from.

@justindotpub
Copy link

An update on the status of this issue...

The work to update the Fission server to use the /dag/stat API is being tracked in #603.

This current issue is staying open to track the following.

  • The DAG stat code uses go-merkledag/traverse.Traverse with SkipDuplicates set to true to traverse the DAG. This can be quite slow, even after all blocks have been downloaded. There may be opportunities to improve the runtime traversal performance. Explore this.
  • What feels like a better solution, if possible, is to pre-compute a deduplicated size value such that just the CID in question can be downloaded and the deduped size can be accessed directly from that, thus avoiding block downloads and DAG traversals. I need to continue experimenting to understand more about IPLD metadata and the possibility / impact of adding more fields. This may be a dead end but it feels worth exploring.

@justindotpub
Copy link

I have no current plans to come back to this. Unassigning for now.

@justindotpub justindotpub removed their assignment Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants