-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resolution dependent chunk sizing to 'modis_l1b' reader #2052
Conversation
The tests are failing because they depend on pytroll/python-geotiepoints#35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work! I'm a bit worried that CHUNK_SIZE disappears completely, as it still is the standard way to steer the chunk size in satpy.
Other than that, I have a bunch of questions inline
# Conflicts: # satpy/readers/modis_l1b.py
Codecov Report
@@ Coverage Diff @@
## main #2052 +/- ##
========================================
Coverage 94.91% 94.92%
========================================
Files 351 350 -1
Lines 51215 51085 -130
========================================
- Hits 48611 48491 -120
+ Misses 2604 2594 -10
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Pull Request Test Coverage Report for Build 6343033343
💛 - Coveralls |
I'm not sure I can track down all variations of my profiling of these changes, but the bottom line is it maybe goes from 32s to 34s in execution time (so worse, but depends) but drops 2-4GB in memory usage. It is hard to compare because EWA resampling only depends on dask's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for tackling the issue of multi-resolution chunk compatibility! Hopefully that will make satpy run better :)
Looks good code-wise, just a few comments and questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion for making the utility function more flexible/generic, otherwise looks good.
satpy/utils.py
Outdated
@@ -631,6 +633,112 @@ def _get_pytroll_chunk_size(): | |||
return None | |||
|
|||
|
|||
def chunks_by_resolution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did quite some thinking about this, and I believe we can make this simpler and more flexible.
Basically the idea is to follow the signature of dask's normalize_chunks
more closely. So that for example the signature of this function could be:
def chunks_by_resolution(chunks, shape, low_res_multipliers, dtype, previous_chunks, high_res_shape):
where chunks
would be eg (-1, "auto")
for one polar data (-1 meaning full width chunk), shape
is the shape of the current array, low_res_multipliers
would be eg (4, 4)
for 1km data, previous_chunks
would be optional and related to on disk storage, and high_res_shape
would be optional (good for the case where hi res data shape is not the low res data shape times the multiplier)
I believe this is all data that the reader has when calling this utility function, and using this signature, we allow having n-dimensional arrays without having to guess anything.
So with this, we can just pass "auto" and -1 from the chunks spec straight to normalize chunks. And I guess that would be the most used cases for the x and y dimensions right?
If we do that, maybe this function should be renamed to eg normalize_low_res_chunks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this I think. I'll have to see how it looks when I implement it. The main idea of allowing parameters for each dimension is worth it enough. I'm not sure I'm sold on the high_res_shape
needing to be a parameter as I'd think that rounding any fractional results would produce the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Major refactor done. Let me know what you think. I didn't add high res shape yet as I'm not sure I need it. Willing to be told otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I should have mentioned, I didn't pass previous chunks to dask because it gave me different results in my unit tests. I'll check on that in a bit and see if that's expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, added previous chunks now. This actually fixed something I hadn't noticed which was that chunks were being returned that weren't actually aligned to the disk chunking (previous chunks). At least not at the coarser resolutions.
I'm not sure I 100% like this logic, but I think it is good enough:
low_res_chunks.append(int(max(hr_chunks[0] / lr_mult, prev_chunks / lr_mult)))
My main concern is that prev_chunks / lr_mult
could produced a smaller chunk size that the on-disk chunking since prev_chunks
could be coming from on-disk and it would be for this particular resolution. So if we took a on-disk chunk of 100 and got 100 / 4, then we'd be getting 4 dask chunks per on-disk chunk which has been shown to hurt performance pretty bad (ex. AGRI files). It could just be set to prev_chunks
as a lower-limit, but then we could end up negating the resolution stuff...actually yeah I'm going to change this.
@mraspaud I'm running into a logical issue with In the case of ABI at least, on-disk chunks are always the same between resolutions which makes sense since the chunking is for compression and disk bandwidth purposes, not resolution-relative processing. In the case of MODIS though, I don't know the on-disk chunking so I'm providing the minimum number of rows per high resolution band (40 rows). For the 1km bands this means I take the chunk size returned by dask for the 250m band, divide by 4 (1km -> 250m), and get my new chunk size. If I do something similar for ABI and use on-disk chunk sizes I'll end up with misaligned chunks. In one of my tests modeled after ABI I get:
Hhhmmm maybe I can pass dask |
Ugh, no, this helps with MODIS (as far as I can tell), but breaks ABI cases because dask is producing a larger chunk size to accommodate the |
I was hoping ondisk chunks would be consistent across resolutions, if that's not the case, we are kind of screwed. |
About the CodeScene failures: for one of the failures it is taking the docstring on the new utility function as the number of lines in that function. That's not fair. The other complaints are valid, but I'm not really sure what to do about them. Too many arguments in the parametrized test function. Yeah, but any other form would make the tests harder to follow/configure. I'll have to think about the number of arguments on the utility function. And lastly, it is right about |
Splitting utils sounds like a pr on its own... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for starting this work on multi resolution chunking!
This pull request is my attempt to do something similar to #1980 with using dask's auto chunk sizing in a reader. However, this PR takes a different approach by choosing chunk sizes based on what auto-chunking would be for the 250m version of a dataset. The idea being that a user is probably going to upsample/interpolate the data to a higher resolution at some point, especially for 5km data. I've been doing a lot of other optimizations with my polar2grid tool and python-geotiepoints so I don't have hard numbers to whether or not this feature makes a difference over hardcoding the chunk size for all variables. I'll try taking a look at that in the future if needed.
@mraspaud I wanted to make this to get your feedback about how much you hate, like, or want changes on what I'm doing.