Add resolution dependent chunk sizing to 'modis_l1b' reader #2052

djhoese · 2022-03-11T01:38:00Z

This pull request is my attempt to do something similar to #1980 with using dask's auto chunk sizing in a reader. However, this PR takes a different approach by choosing chunk sizes based on what auto-chunking would be for the 250m version of a dataset. The idea being that a user is probably going to upsample/interpolate the data to a higher resolution at some point, especially for 5km data. I've been doing a lot of other optimizations with my polar2grid tool and python-geotiepoints so I don't have hard numbers to whether or not this feature makes a difference over hardcoding the chunk size for all variables. I'll try taking a look at that in the future if needed.

@mraspaud I wanted to make this to get your feedback about how much you hate, like, or want changes on what I'm doing.

Closes #xxxx
Tests added
Fully documented

djhoese · 2022-03-11T11:50:20Z

The tests are failing because they depend on pytroll/python-geotiepoints#35

mraspaud

Thanks for this work! I'm a bit worried that CHUNK_SIZE disappears completely, as it still is the standard way to steer the chunk size in satpy.
Other than that, I have a bunch of questions inline

satpy/readers/hdfeos_base.py

satpy/readers/modis_l1b.py

satpy/readers/hdfeos_base.py

# Conflicts: # satpy/readers/modis_l1b.py

codecov · 2023-09-28T19:02:47Z

Codecov Report

Merging #2052 (2036251) into main (95eb01a) will increase coverage by 0.00%.
Report is 5 commits behind head on main.
The diff coverage is 86.17%.

@@           Coverage Diff            @@
##             main    #2052    +/-   ##
========================================
  Coverage   94.91%   94.92%            
========================================
  Files         351      350     -1     
  Lines       51215    51085   -130     
========================================
- Hits        48611    48491   -120     
+ Misses       2604     2594    -10

Flag	Coverage Δ
behaviourtests	`4.28% <6.38%> (-0.01%)`	⬇️
unittests	`95.54% <97.50%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
satpy/_compat.py	`57.14% <100.00%> (-8.17%)`	⬇️
satpy/_config.py	`96.59% <100.00%> (+5.01%)`	⬆️
satpy/readers/modis_l1b.py	`65.04% <100.00%> (ø)`
satpy/resample.py	`79.80% <100.00%> (+0.49%)`	⬆️
.../tests/reader_tests/modis_tests/_modis_fixtures.py	`97.61% <100.00%> (ø)`
satpy/tests/test_utils.py	`100.00% <100.00%> (ø)`
satpy/readers/hdfeos_base.py	`92.63% <93.75%> (-0.02%)`	⬇️
...y/tests/reader_tests/modis_tests/test_modis_l1b.py	`99.15% <94.73%> (-0.85%)`	⬇️
satpy/utils.py	`26.88% <21.42%> (-0.42%)`	⬇️

coveralls · 2023-09-28T19:19:51Z

Pull Request Test Coverage Report for Build 6343033343

88 of 92 (95.65%) changed or added relevant lines in 8 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.03%) to 95.492%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
satpy/readers/hdfeos_base.py	14	15	93.33%
satpy/tests/reader_tests/modis_tests/test_modis_l1b.py	18	19	94.74%
satpy/utils.py	22	24	91.67%

Totals
Change from base Build 6339986467:	0.03%
Covered Lines:	48613
Relevant Lines:	50908

💛 - Coveralls

djhoese · 2023-09-29T18:27:49Z

I'm not sure I can track down all variations of my profiling of these changes, but the bottom line is it maybe goes from 32s to 34s in execution time (so worse, but depends) but drops 2-4GB in memory usage. It is hard to compare because EWA resampling only depends on dask's array.chunk-size while this PR makes MODIS change from needing PYTROLL_CHUNK_SIZE to dask's config. So the chunking is more consistent on the reader side of things, but now changing the reader's chunk size (via dask's config option) also changes the chunk size of the output area. And as mentioned on slack the high memory usage here is not because of the reader at all, it is the EWA resampling doing way too much fancy stuff with chunking. If we could reliably take a SwathDefinition and say "give me the chunk that overlaps this sub-area of the target area" (similar to gradient search) then we'd be able to do things in a much smarter way I think.

mraspaud

Thanks a lot for tackling the issue of multi-resolution chunk compatibility! Hopefully that will make satpy run better :)
Looks good code-wise, just a few comments and questions.

satpy/utils.py

satpy/tests/test_utils.py

satpy/readers/hdfeos_base.py

mraspaud

I have a suggestion for making the utility function more flexible/generic, otherwise looks good.

satpy/readers/hdfeos_base.py

mraspaud · 2023-10-05T09:05:00Z

satpy/utils.py

@@ -631,6 +633,112 @@ def _get_pytroll_chunk_size():
        return None


+def chunks_by_resolution(


Ok, I did quite some thinking about this, and I believe we can make this simpler and more flexible.
Basically the idea is to follow the signature of dask's normalize_chunks more closely. So that for example the signature of this function could be:

def chunks_by_resolution(chunks, shape, low_res_multipliers, dtype, previous_chunks, high_res_shape):

where chunks would be eg (-1, "auto") for one polar data (-1 meaning full width chunk), shape is the shape of the current array, low_res_multipliers would be eg (4, 4) for 1km data, previous_chunks would be optional and related to on disk storage, and high_res_shape would be optional (good for the case where hi res data shape is not the low res data shape times the multiplier)
I believe this is all data that the reader has when calling this utility function, and using this signature, we allow having n-dimensional arrays without having to guess anything.
So with this, we can just pass "auto" and -1 from the chunks spec straight to normalize chunks. And I guess that would be the most used cases for the x and y dimensions right?
If we do that, maybe this function should be renamed to eg normalize_low_res_chunks?

I'm OK with this I think. I'll have to see how it looks when I implement it. The main idea of allowing parameters for each dimension is worth it enough. I'm not sure I'm sold on the high_res_shape needing to be a parameter as I'd think that rounding any fractional results would produce the same thing.

Ok. Major refactor done. Let me know what you think. I didn't add high res shape yet as I'm not sure I need it. Willing to be told otherwise.

Oh I should have mentioned, I didn't pass previous chunks to dask because it gave me different results in my unit tests. I'll check on that in a bit and see if that's expected.

Ok, added previous chunks now. This actually fixed something I hadn't noticed which was that chunks were being returned that weren't actually aligned to the disk chunking (previous chunks). At least not at the coarser resolutions.

I'm not sure I 100% like this logic, but I think it is good enough:

low_res_chunks.append(int(max(hr_chunks[0] / lr_mult, prev_chunks / lr_mult)))

My main concern is that prev_chunks / lr_mult could produced a smaller chunk size that the on-disk chunking since prev_chunks could be coming from on-disk and it would be for this particular resolution. So if we took a on-disk chunk of 100 and got 100 / 4, then we'd be getting 4 dask chunks per on-disk chunk which has been shown to hurt performance pretty bad (ex. AGRI files). It could just be set to prev_chunks as a lower-limit, but then we could end up negating the resolution stuff...actually yeah I'm going to change this.

djhoese · 2023-10-05T14:05:24Z

@mraspaud I'm running into a logical issue with previous_chunks. If I pass it to dask's normalize_chunks to get the chunking for the highest resolution band, then I'm assuming that previous_chunks is referring to the high resolution data storage. For most cases this isn't true, I'm using the on-disk chunks for the current resolution of data.

In the case of ABI at least, on-disk chunks are always the same between resolutions which makes sense since the chunking is for compression and disk bandwidth purposes, not resolution-relative processing. In the case of MODIS though, I don't know the on-disk chunking so I'm providing the minimum number of rows per high resolution band (40 rows). For the 1km bands this means I take the chunk size returned by dask for the 250m band, divide by 4 (1km -> 250m), and get my new chunk size. If I do something similar for ABI and use on-disk chunk sizes I'll end up with misaligned chunks. In one of my tests modeled after ABI I get:

chunks for 500m => 2034
chunks for 1km => 2034 / 2 => 1017
1017 / 226 on-disk chunk size => 4.5 (not aligned)

Hhhmmm maybe I can pass dask previous_chunks * low_res_multiplier and it would behave sort of as I expect...maybe. This does assume that on-disk chunks do not differ widely between resolutions though.

djhoese · 2023-10-05T14:14:15Z

Ugh, no, this helps with MODIS (as far as I can tell), but breaks ABI cases because dask is producing a larger chunk size to accommodate the previous_chunks * low_res_multiplier previous chunk size. I'm not sure I can use previous_chunks in the dask call. I think that was one thing you were hoping for.

mraspaud · 2023-10-05T15:38:24Z

I was hoping ondisk chunks would be consistent across resolutions, if that's not the case, we are kind of screwed.

djhoese · 2023-10-05T18:52:23Z

About the CodeScene failures: for one of the failures it is taking the docstring on the new utility function as the number of lines in that function. That's not fair.

The other complaints are valid, but I'm not really sure what to do about them. Too many arguments in the parametrized test function. Yeah, but any other form would make the tests harder to follow/configure.

I'll have to think about the number of arguments on the utility function.

And lastly, it is right about satpy/utils.py. It does too much and should be separated into specific modules (ex. satpy/_chunking.py).

mraspaud · 2023-10-05T21:10:54Z

Splitting utils sounds like a pr on its own...

mraspaud

LGTM, thanks for starting this work on multi resolution chunking!

djhoese added 2 commits March 10, 2022 19:27

Remove unused navigations section in modis_l1b reader

076748c

Add resolution-dependent chunk sizing to 'modis_l1b' reader

20bf879

djhoese added enhancement code enhancements, features, improvements component:readers optimization labels Mar 11, 2022

djhoese requested a review from mraspaud as a code owner March 11, 2022 01:38

djhoese self-assigned this Mar 11, 2022

Remove debug print statements

876c69a

mraspaud reviewed Mar 16, 2022

View reviewed changes

djhoese added 5 commits September 22, 2023 13:03

Merge branch 'main' into feature-modisl1b-dyn-chunks

5277946

# Conflicts: # satpy/readers/modis_l1b.py

Remove unused chunk size in modis readers

d1a433d

Fix chunk checks in MODIS L1b tests

d4ce372

Remove compatibility with Python 3.8 and below

0300434

Add initial utility function for resolution-based chunking

482f3e2

djhoese requested a review from pnuu as a code owner September 27, 2023 19:09

djhoese added 3 commits September 28, 2023 08:40

Remove tests for removed compatibility code

dcb87ff

Convert some utility tests to pytest

e0a8ea4

Merge branch 'main' into feature-modisl1b-dyn-chunks

cb90a68

Add tests for resolution-based chunking utilities

37170e1

djhoese mentioned this pull request Oct 2, 2023

Convert AHI HSD dask chunking to be based on band resolution #2584

Merged

4 tasks

mraspaud reviewed Oct 3, 2023

View reviewed changes

satpy/utils.py Show resolved Hide resolved

satpy/utils.py Outdated Show resolved Hide resolved

satpy/utils.py Show resolved Hide resolved

satpy/tests/test_utils.py Show resolved Hide resolved

satpy/readers/hdfeos_base.py Show resolved Hide resolved

Fix reference to dask documentation

4954d96

mraspaud reviewed Oct 5, 2023

View reviewed changes

djhoese added 3 commits October 5, 2023 06:09

Refactor resolution-based chunking to be more flexible

da74ce4

Ensure resolution chunks are disk-chunk aligned

dfe1dc9

Add lower-limit of on-disk chunk size for resolution-based chunking

04b0161

Add info about handling on-disk chunking

2036251

mraspaud approved these changes Oct 5, 2023

View reviewed changes

mraspaud merged commit 8d68425 into pytroll:main Oct 5, 2023
16 of 20 checks passed

djhoese mentioned this pull request Oct 30, 2023

Add resolution-based chunking to ABI L1b reader #2621

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resolution dependent chunk sizing to 'modis_l1b' reader #2052

Add resolution dependent chunk sizing to 'modis_l1b' reader #2052

djhoese commented Mar 11, 2022

djhoese commented Mar 11, 2022

mraspaud left a comment

codecov bot commented Sep 28, 2023 •

edited

Loading

coveralls commented Sep 28, 2023

djhoese commented Sep 29, 2023

mraspaud left a comment

mraspaud left a comment

mraspaud Oct 5, 2023

djhoese Oct 5, 2023

djhoese Oct 5, 2023

djhoese Oct 5, 2023

djhoese Oct 5, 2023

djhoese commented Oct 5, 2023

djhoese commented Oct 5, 2023

mraspaud commented Oct 5, 2023

djhoese commented Oct 5, 2023

mraspaud commented Oct 5, 2023

mraspaud left a comment

		@@ -631,6 +633,112 @@ def _get_pytroll_chunk_size():
		return None


		def chunks_by_resolution(

Add resolution dependent chunk sizing to 'modis_l1b' reader #2052

Add resolution dependent chunk sizing to 'modis_l1b' reader #2052

Conversation

djhoese commented Mar 11, 2022

djhoese commented Mar 11, 2022

mraspaud left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 28, 2023 • edited Loading

Codecov Report

coveralls commented Sep 28, 2023

Pull Request Test Coverage Report for Build 6343033343

💛 - Coveralls

djhoese commented Sep 29, 2023

mraspaud left a comment

Choose a reason for hiding this comment

mraspaud left a comment

Choose a reason for hiding this comment

mraspaud Oct 5, 2023

Choose a reason for hiding this comment

djhoese Oct 5, 2023

Choose a reason for hiding this comment

djhoese Oct 5, 2023

Choose a reason for hiding this comment

djhoese Oct 5, 2023

Choose a reason for hiding this comment

djhoese Oct 5, 2023

Choose a reason for hiding this comment

djhoese commented Oct 5, 2023

djhoese commented Oct 5, 2023

mraspaud commented Oct 5, 2023

djhoese commented Oct 5, 2023

mraspaud commented Oct 5, 2023

mraspaud left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 28, 2023 •

edited

Loading