Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable apparent-size in du command of get_size_on_disk #6702

Merged
merged 6 commits into from
Jan 13, 2025

Conversation

GeigerJ2
Copy link
Contributor

In #6584 the get_size_on_disk method was added to RemoteData, which calls the du utility as the default way to obtain the size of an associated file/directory, using the -s (summary) and --bytes option. As discovered in #6696, however, the relevant tests started failing when the ubuntu-latest environment of GHA got updated to point to Ubuntu 24.04 rather than 22.04.

As it turns out, --bytes does not only give the output in bytes, but is actually equivalent to --apparent-size --block-size=1. I'm not sure what exactly has changed between the different Ubuntu versions, if it's the behavior of the --apparent-size flag of du, or the way the apparent size is determined by the file system. Nonetheless, using apparent size makes the output fragile, as seen from the failing tests. Thus, I now changed the command from du -s --bytes to du -s --block-size=1. This also leads to the obtained file sizes for the different test cases to be larger (as expected), so these are also updated.

@GeigerJ2 GeigerJ2 requested a review from unkcpz January 13, 2025 15:49
Copy link

codecov bot commented Jan 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.99%. Comparing base (15b5caf) to head (cf9b884).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6702      +/-   ##
==========================================
+ Coverage   77.99%   77.99%   +0.01%     
==========================================
  Files         563      563              
  Lines       41761    41762       +1     
==========================================
+ Hits        32567    32570       +3     
+ Misses       9194     9192       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@unkcpz
Copy link
Member

unkcpz commented Jan 13, 2025

Thanks, I assume the different behavior can be tracked down to following change in coreutils, see https://fossies.org/linux/coreutils/ChangeLog

 3377 2023-03-13  Pádraig Brady  <[email protected]>
 3378 
 3379 	tests: adjust du test for recent apparent size change
 3380 	* tests/du/threshold.sh: Directories are assumed to be
 3381 	of size 0 with --apparent since commit v9.1-187-g110bcd283
 3382 	so remove --apparent cases from this test.

It seems the --apparent-size give a more realistic size of file you created rather than the block size that can depend on the FS. But not sure how to use the combination of options to make the output consistent between different du version.

Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring this out!
Just one ask for an explanatory comment in the code (i.e. basically what you write in a PR description).

.github/workflows/ci-code.yml Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Show resolved Hide resolved
@GeigerJ2
Copy link
Contributor Author

Interesting, thanks for digging this up, @unkcpz!

It seems the --apparent-size give a more realistic size of file you created rather than the block size that can depend on the FS. But not sure how to use the combination of options to make the output consistent between different du version.

Yeah, the apparent size is more realistic in that sense, as it only captures the actual data. Nonetheless, on your file system, even a file with less content will occupy the entire block, so I feel like it's not really that useful in the end (the other option, stat does actually give the apparent size). Now, the issue is still that block sizes might differ, though, 4096 bytes should be the default for Linux (ext4), and also macOS, so possibly it's fine to just go with that?

@danielhollas
Copy link
Collaborator

Nonetheless, on your file system, even a file with less content will occupy the entire block, so I feel like it's not really that useful in the end.
I agree, ultimately we want to use options that are useful for the users, and not based in on the test suite. The users of --get_size_on_disk are interested in the actual space that things take right? And probably they are interested the largest files anyway, where the difference should be smaller.

If the tests can't be stabilized, perhaps then we need to modify the tests, not the implementation...

@unkcpz
Copy link
Member

unkcpz commented Jan 13, 2025

Nonetheless, on your file system, even a file with less content will occupy the entire block, so I feel like it's not really that useful in the end (the other option, stat does actually give the apparent size). Now, the issue is still that block sizes might differ, though, 4096 bytes should be the default for Linux (ext4), and also macOS, so possibly it's fine to just go with that?

From users' point of view, I think using just the block size as current implementation can be better. The actual size may not matter too much but the size transfer over internet (by buffer size I think usually the default is 4kb) or the size occupy on the disk which mostly are computed by block. Thus I len to the changes of this PR.

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor request on docstring, otherwise all good for me.

src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
@GeigerJ2
Copy link
Contributor Author

If the tests can't be stabilized, perhaps then we need to modify the tests, not the implementation...

@danielhollas I'd say the actual problem was the implementation :D But this whole file size business is pretty annoying. We spent quite some time discussing this in the office already.

Anything else to add from your side?

Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for the extensive comment! Just one quip about the double meaning of "block size", but not a blocker.

src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
@danielhollas
Copy link
Collaborator

Ah, I see that the tests are already failing for the last commit on main so this should just go in I think.

@GeigerJ2 GeigerJ2 merged commit 2da3f96 into aiidateam:main Jan 13, 2025
10 checks passed
@GeigerJ2 GeigerJ2 deleted the fix/du-tests branch January 13, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants