Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resolution dependent chunk sizing to 'modis_l1b' reader #2052

Merged
merged 17 commits into from
Oct 5, 2023

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Mar 11, 2022

This pull request is my attempt to do something similar to #1980 with using dask's auto chunk sizing in a reader. However, this PR takes a different approach by choosing chunk sizes based on what auto-chunking would be for the 250m version of a dataset. The idea being that a user is probably going to upsample/interpolate the data to a higher resolution at some point, especially for 5km data. I've been doing a lot of other optimizations with my polar2grid tool and python-geotiepoints so I don't have hard numbers to whether or not this feature makes a difference over hardcoding the chunk size for all variables. I'll try taking a look at that in the future if needed.

@mraspaud I wanted to make this to get your feedback about how much you hate, like, or want changes on what I'm doing.

  • Closes #xxxx
  • Tests added
  • Fully documented

@djhoese djhoese added enhancement code enhancements, features, improvements component:readers optimization labels Mar 11, 2022
@djhoese djhoese requested a review from mraspaud as a code owner March 11, 2022 01:38
@djhoese djhoese self-assigned this Mar 11, 2022
@djhoese
Copy link
Member Author

djhoese commented Mar 11, 2022

The tests are failing because they depend on pytroll/python-geotiepoints#35

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work! I'm a bit worried that CHUNK_SIZE disappears completely, as it still is the standard way to steer the chunk size in satpy.
Other than that, I have a bunch of questions inline

satpy/readers/hdfeos_base.py Show resolved Hide resolved
satpy/readers/hdfeos_base.py Outdated Show resolved Hide resolved
satpy/readers/modis_l1b.py Outdated Show resolved Hide resolved
satpy/readers/hdfeos_base.py Outdated Show resolved Hide resolved
satpy/readers/hdfeos_base.py Outdated Show resolved Hide resolved
@djhoese djhoese requested a review from pnuu as a code owner September 27, 2023 19:09
@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Merging #2052 (2036251) into main (95eb01a) will increase coverage by 0.00%.
Report is 5 commits behind head on main.
The diff coverage is 86.17%.

@@           Coverage Diff            @@
##             main    #2052    +/-   ##
========================================
  Coverage   94.91%   94.92%            
========================================
  Files         351      350     -1     
  Lines       51215    51085   -130     
========================================
- Hits        48611    48491   -120     
+ Misses       2604     2594    -10     
Flag Coverage Δ
behaviourtests 4.28% <6.38%> (-0.01%) ⬇️
unittests 95.54% <97.50%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
satpy/_compat.py 57.14% <100.00%> (-8.17%) ⬇️
satpy/_config.py 96.59% <100.00%> (+5.01%) ⬆️
satpy/readers/modis_l1b.py 65.04% <100.00%> (ø)
satpy/resample.py 79.80% <100.00%> (+0.49%) ⬆️
.../tests/reader_tests/modis_tests/_modis_fixtures.py 97.61% <100.00%> (ø)
satpy/tests/test_utils.py 100.00% <100.00%> (ø)
satpy/readers/hdfeos_base.py 92.63% <93.75%> (-0.02%) ⬇️
...y/tests/reader_tests/modis_tests/test_modis_l1b.py 99.15% <94.73%> (-0.85%) ⬇️
satpy/utils.py 26.88% <21.42%> (-0.42%) ⬇️

@coveralls
Copy link

Pull Request Test Coverage Report for Build 6343033343

  • 88 of 92 (95.65%) changed or added relevant lines in 8 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 95.492%

Changes Missing Coverage Covered Lines Changed/Added Lines %
satpy/readers/hdfeos_base.py 14 15 93.33%
satpy/tests/reader_tests/modis_tests/test_modis_l1b.py 18 19 94.74%
satpy/utils.py 22 24 91.67%
Totals Coverage Status
Change from base Build 6339986467: 0.03%
Covered Lines: 48613
Relevant Lines: 50908

💛 - Coveralls

@djhoese
Copy link
Member Author

djhoese commented Sep 29, 2023

I'm not sure I can track down all variations of my profiling of these changes, but the bottom line is it maybe goes from 32s to 34s in execution time (so worse, but depends) but drops 2-4GB in memory usage. It is hard to compare because EWA resampling only depends on dask's array.chunk-size while this PR makes MODIS change from needing PYTROLL_CHUNK_SIZE to dask's config. So the chunking is more consistent on the reader side of things, but now changing the reader's chunk size (via dask's config option) also changes the chunk size of the output area. And as mentioned on slack the high memory usage here is not because of the reader at all, it is the EWA resampling doing way too much fancy stuff with chunking. If we could reliably take a SwathDefinition and say "give me the chunk that overlaps this sub-area of the target area" (similar to gradient search) then we'd be able to do things in a much smarter way I think.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for tackling the issue of multi-resolution chunk compatibility! Hopefully that will make satpy run better :)
Looks good code-wise, just a few comments and questions.

satpy/utils.py Show resolved Hide resolved
satpy/utils.py Outdated Show resolved Hide resolved
satpy/utils.py Show resolved Hide resolved
satpy/tests/test_utils.py Show resolved Hide resolved
satpy/readers/hdfeos_base.py Show resolved Hide resolved
Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion for making the utility function more flexible/generic, otherwise looks good.

satpy/readers/hdfeos_base.py Show resolved Hide resolved
satpy/utils.py Outdated
@@ -631,6 +633,112 @@ def _get_pytroll_chunk_size():
return None


def chunks_by_resolution(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did quite some thinking about this, and I believe we can make this simpler and more flexible.
Basically the idea is to follow the signature of dask's normalize_chunks more closely. So that for example the signature of this function could be:

def chunks_by_resolution(chunks, shape, low_res_multipliers, dtype, previous_chunks, high_res_shape):

where chunks would be eg (-1, "auto") for one polar data (-1 meaning full width chunk), shape is the shape of the current array, low_res_multipliers would be eg (4, 4) for 1km data, previous_chunks would be optional and related to on disk storage, and high_res_shape would be optional (good for the case where hi res data shape is not the low res data shape times the multiplier)
I believe this is all data that the reader has when calling this utility function, and using this signature, we allow having n-dimensional arrays without having to guess anything.
So with this, we can just pass "auto" and -1 from the chunks spec straight to normalize chunks. And I guess that would be the most used cases for the x and y dimensions right?
If we do that, maybe this function should be renamed to eg normalize_low_res_chunks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this I think. I'll have to see how it looks when I implement it. The main idea of allowing parameters for each dimension is worth it enough. I'm not sure I'm sold on the high_res_shape needing to be a parameter as I'd think that rounding any fractional results would produce the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Major refactor done. Let me know what you think. I didn't add high res shape yet as I'm not sure I need it. Willing to be told otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I should have mentioned, I didn't pass previous chunks to dask because it gave me different results in my unit tests. I'll check on that in a bit and see if that's expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added previous chunks now. This actually fixed something I hadn't noticed which was that chunks were being returned that weren't actually aligned to the disk chunking (previous chunks). At least not at the coarser resolutions.

I'm not sure I 100% like this logic, but I think it is good enough:

low_res_chunks.append(int(max(hr_chunks[0] / lr_mult, prev_chunks / lr_mult)))

My main concern is that prev_chunks / lr_mult could produced a smaller chunk size that the on-disk chunking since prev_chunks could be coming from on-disk and it would be for this particular resolution. So if we took a on-disk chunk of 100 and got 100 / 4, then we'd be getting 4 dask chunks per on-disk chunk which has been shown to hurt performance pretty bad (ex. AGRI files). It could just be set to prev_chunks as a lower-limit, but then we could end up negating the resolution stuff...actually yeah I'm going to change this.

@djhoese
Copy link
Member Author

djhoese commented Oct 5, 2023

@mraspaud I'm running into a logical issue with previous_chunks. If I pass it to dask's normalize_chunks to get the chunking for the highest resolution band, then I'm assuming that previous_chunks is referring to the high resolution data storage. For most cases this isn't true, I'm using the on-disk chunks for the current resolution of data.

In the case of ABI at least, on-disk chunks are always the same between resolutions which makes sense since the chunking is for compression and disk bandwidth purposes, not resolution-relative processing. In the case of MODIS though, I don't know the on-disk chunking so I'm providing the minimum number of rows per high resolution band (40 rows). For the 1km bands this means I take the chunk size returned by dask for the 250m band, divide by 4 (1km -> 250m), and get my new chunk size. If I do something similar for ABI and use on-disk chunk sizes I'll end up with misaligned chunks. In one of my tests modeled after ABI I get:

  • chunks for 500m => 2034
  • chunks for 1km => 2034 / 2 => 1017
  • 1017 / 226 on-disk chunk size => 4.5 (not aligned)

Hhhmmm maybe I can pass dask previous_chunks * low_res_multiplier and it would behave sort of as I expect...maybe. This does assume that on-disk chunks do not differ widely between resolutions though.

@djhoese
Copy link
Member Author

djhoese commented Oct 5, 2023

Ugh, no, this helps with MODIS (as far as I can tell), but breaks ABI cases because dask is producing a larger chunk size to accommodate the previous_chunks * low_res_multiplier previous chunk size. I'm not sure I can use previous_chunks in the dask call. I think that was one thing you were hoping for.

@mraspaud
Copy link
Member

mraspaud commented Oct 5, 2023

I was hoping ondisk chunks would be consistent across resolutions, if that's not the case, we are kind of screwed.

@djhoese
Copy link
Member Author

djhoese commented Oct 5, 2023

About the CodeScene failures: for one of the failures it is taking the docstring on the new utility function as the number of lines in that function. That's not fair.

The other complaints are valid, but I'm not really sure what to do about them. Too many arguments in the parametrized test function. Yeah, but any other form would make the tests harder to follow/configure.

I'll have to think about the number of arguments on the utility function.

And lastly, it is right about satpy/utils.py. It does too much and should be separated into specific modules (ex. satpy/_chunking.py).

@mraspaud
Copy link
Member

mraspaud commented Oct 5, 2023

Splitting utils sounds like a pr on its own...

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for starting this work on multi resolution chunking!

@mraspaud mraspaud merged commit 8d68425 into pytroll:main Oct 5, 2023
16 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers enhancement code enhancements, features, improvements optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants