Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preload (FCI) filehandlers for eager processing #2686

Open
wants to merge 83 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
065bc25
Start work on preloading filehandlers
gerritholl Dec 12, 2023
d0f4e47
Prepare yaml reader for preloading filehandlers
gerritholl Dec 13, 2023
36c71e9
Continue pre-loading implementation in yaml reader
gerritholl Dec 13, 2023
eeb05fa
Don't fail to predict files if unasked
gerritholl Dec 13, 2023
56575f5
Progress on Preloadable
gerritholl Dec 14, 2023
394be48
Add on-disk caching between repeat cycles
gerritholl Dec 15, 2023
437b096
More work on repeat-cycle caching
gerritholl Dec 15, 2023
c578a1d
Bump version number for "version added"
gerritholl Dec 18, 2023
87010f8
Unset automaskandscale when preloading
gerritholl Dec 18, 2023
75916ab
Split delayed loading in waiting and loading
gerritholl Dec 18, 2023
2d3fbb6
Add tests for waiting for file
gerritholl Dec 18, 2023
1e1cb21
Add test method for getting delayed from a file
gerritholl Dec 18, 2023
19c8b3c
Progress on tests for pre-loading
gerritholl Dec 18, 2023
875dcce
Update header in FCI source file
gerritholl Dec 18, 2023
1be2c79
Add tests and cleanup implementation
gerritholl Dec 19, 2023
c1a9cf4
Improve yaml reader tests
gerritholl Dec 19, 2023
c2f23cd
Improve tests for yaml reader and netcdf utils
gerritholl Dec 20, 2023
3093aff
Use max rather than min date for mocking
gerritholl Dec 20, 2023
4ffe742
Simplify method and repair windows support
gerritholl Dec 20, 2023
558ff3a
Wait longer in test
gerritholl Dec 20, 2023
690eedf
Control max tries with kw arguments
gerritholl Dec 20, 2023
31815dc
Added test for create_preloadable_cache
gerritholl Dec 20, 2023
fe2e992
Add documentation on pre-loading/eager processing
gerritholl Dec 20, 2023
d5365f1
Small doc fixes
gerritholl Dec 21, 2023
8de6a4d
Improve tests further
gerritholl Dec 21, 2023
fd1f7cf
Expanded preloading test
gerritholl Dec 21, 2023
1ef18ee
Guard against unmatching files
gerritholl Dec 21, 2023
5808c9f
Use different time sentinel
gerritholl Dec 21, 2023
e0a38cc
Add more logging
gerritholl Dec 21, 2023
68dc9d8
Use xarray rather than netCDF4 for opening
gerritholl Dec 21, 2023
c78d246
Use h5netcdf backend.
gerritholl Dec 21, 2023
9a60a25
Merge branch 'main' into preload-scene-dask-delayed
gerritholl Dec 21, 2023
c8d5ba4
Adapt docs and logging
gerritholl Dec 22, 2023
b99ea95
Merge branch 'main' into preload-scene-dask-delayed
gerritholl Feb 13, 2024
795d882
Refactor unit test test_preloaded_intsances
gerritholl Feb 13, 2024
a288799
Change fallback Preloadable
gerritholl Feb 14, 2024
da7d566
Refactor _wait_for_file
gerritholl Feb 14, 2024
9bafbbc
Refactor _can_get_from_other_rc
gerritholl Feb 14, 2024
fd4b225
Remove dead code
gerritholl Feb 16, 2024
445fc56
Merge branch 'main' into preload-scene-dask-delayed
gerritholl Jun 12, 2024
205121b
Doc fixes + remove accidentally removed lines
gerritholl Jun 12, 2024
d362ff9
Rename Preloadable class
gerritholl Jun 12, 2024
30e9755
Split test case in three
gerritholl Jun 12, 2024
38431d3
Use configuration parameters for preloading
gerritholl Jun 12, 2024
cda8581
secende/rejoin
gerritholl Jun 13, 2024
7f6a8d4
Add test to reproduce GH 2815
gerritholl Jun 14, 2024
6d31c20
make sure distributed client is local
gerritholl Jun 14, 2024
1e26d1a
Start utility function for distributed friendly
gerritholl Jun 14, 2024
be40c5b
Parameterise test and simplify implementation
gerritholl Jun 14, 2024
cbd00f0
Force shape and dtype. First working prototype.
gerritholl Jun 14, 2024
af4ee66
Add group support and speed up tests
gerritholl Jun 20, 2024
dad3b14
Add partial backward-compatibility fol file handle
gerritholl Jun 20, 2024
fc58ca4
Respect auto_maskandscale with new caching
gerritholl Jun 20, 2024
09c821a
Remove needless except block
gerritholl Jun 20, 2024
4f9c5ed
Test refactoring
gerritholl Jun 20, 2024
6aac13b
Merge branch 'bugfix-2815' into preload-scene-dask-delayed
gerritholl Jun 20, 2024
ec76fa6
Broaden test match string for test_filenotfound
gerritholl Jun 20, 2024
11e0d0f
Allow to use distributed or not
gerritholl Jun 20, 2024
06d8811
fix docstring example spelling
gerritholl Jul 24, 2024
aaf91b9
Prevent unexpected type promotion in unit test
gerritholl Jul 24, 2024
a2ad42f
Use block info getting a dd-friendly da
gerritholl Jul 24, 2024
9126bbe
Rename to serialisable and remove group argument
gerritholl Jul 25, 2024
5e576f9
Use wrapper class for auto_maskandscale
gerritholl Jul 25, 2024
63e7507
GB -> US spelling
gerritholl Jul 25, 2024
ea04595
Ensure meta dtype
gerritholl Jul 25, 2024
523671a
Merge branch 'main' into bugfix-2815
gerritholl Jul 25, 2024
fde3896
Fix spelling in test
gerritholl Jul 25, 2024
6a81e8c
Merge branch 'bugfix-2815' into preload-scene-dask-delayed
gerritholl Jul 25, 2024
5b137e8
Clarify docstring
gerritholl Jul 26, 2024
c2b1533
Use cache already in scene creation
gerritholl Jul 26, 2024
9fce5a7
Use helper function rather than subclass
gerritholl Jul 26, 2024
4993b65
restore non-cached group retrieval
gerritholl Jul 26, 2024
bc6af58
Merge branch 'bugfix-2815' into preload-scene-dask-delayed
gerritholl Jul 30, 2024
aca09ff
Guard against file handle caching when preloading
gerritholl Jul 30, 2024
ec49a6f
Preload netCDF4.Variable, not xarray.DataArray, from reference fileha…
gerritholl Jul 31, 2024
2b02228
Reorganise preloading documentation
gerritholl Aug 2, 2024
a1847b7
Fix typo in documentation
gerritholl Aug 2, 2024
32f2ff7
Rename preload configuration parameters
gerritholl Aug 2, 2024
aa36a75
Merge branch 'main' into preload-scene-dask-delayed
gerritholl Aug 2, 2024
269036e
Refactor test for getting the cache filename
gerritholl Aug 2, 2024
41637fa
Update versionadded to show 0.51, not 0.50
gerritholl Aug 5, 2024
bde842e
version added: 0.52
gerritholl Aug 15, 2024
3736fbf
Merge branch 'main' into preload-scene-dask-delayed
gerritholl Aug 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions doc/source/advanced/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
Advanced topics
===============

This section of the documentation contains advanced topics that are not relevant
for most users.

.. _preload:

Preloading segments for improved timeliness
-------------------------------------------

Normally, data need to exist before they can be read. This requirement
impacts data processing timeliness. For data arriving in segments,
Satpy can process each segment immediately as it comes in.
This feature currently only works with the :mod:`~satpy.readers.fci_l1c_nc` reader.
This is experimental and likely to be instable and might change.

Consider a near real time data reception situation where FCI segments are
delivered one by one. Classically, to produce full disc imagery, users
would wait for all needed segments to arrive, before they start processing
any data by passing all segments to the :class:`~satpy.scene.Scene`.
For a more timely imagery production, users can create the Scene, load the
data, resample, and even call :meth:`~satpy.scene.Scene.save_datasets`
before the data are complete (:meth:`~satpy.scene.Scene.save_datasets` will wait until the data
are available, unless ``compute=False``).
Upon computation, much of the overhead in Satpy
internals has already been completed, and Satpy will process each segment
as it comes in.

To do so, Satpy caches a selection of data and metadata between segments
and between repeat cycles. Caching between segments happens in-memory
and needs no preparation from the user, but where data are cached
between repeat cycles, the user needs to create this cache first from
a repeat cycle that is available completely::

>>> from satpy.readers import create_preloadable_cache
>>> create_preloadable_cache("fci_l1c_nc", fci_files)

This needs to be done only once as long as data or metadata cached
between repeat cycles does not change (for example, the rows at which
each repeat cycle starts and ends). To make use of eager processing, set
the configuration variable ``readers.preload_segments``. When creating
the scene, pass only the path to the first segment::

>>> satpy.config.set({"readers.preload_segments": True})
>>> sc = Scene(
... filenames=[path_to_first_segment],
... reader="fci_l1c_nc")

Satpy will figure out the names of the remaining segments and find them as
they come in. If the data are already available, processing is similar to
the regular case. If the data are not yet available, Satpy will wait during
the computation of the dask graphs until data become available.

For additional configuration parameters, see the :ref:`configuration documentation <preload_settings>`.

Known limitations as of Satpy 0.51:

- Mixing different file types for the same reader is not yet supported.
For FCI, that means it is not yet possible to mix FDHSI and HRFI data.
- When segments are missing, processing times out and no image will be produced.
There is currently no way to produce an incomplete image with the missing
segment left out.
- Dask may not order the processing of the chunks optimally. That means some
dask workers may be waiting for chunks 33–40 as chunks 1–32 are coming in
and are not being processed. Possible workarounds:
- Use as many workers are there are chunks (for example, 40).
- Use the dask distributed scheduler using
``from dask.distributed import Client; Client()``. This has only
limited support in Satpy and is highly experimental. It should be possible
to read FCI L1C data, resample it
using the gradient search resampler, and write the resulting data using the
``simple_image`` writer. The nearest neighbour resampler or the GeoTIFF
writer do not currently work (see https://github.com/pytroll/satpy/issues/1762).
If you use this scheduler, set the configuration variable
``readers.preload_dask_distributed`` to True.
This is not currently recommended in a production environment. Any feedback
is highly welcome.
- Currently, Satpy merely checks the existence of a file and not whether it
has been completely written. This may lead to incomplete files being read,
which might lead to failures.

For more technical background reading including hints
on how this could be extended to other readers, see the API documentations for
:class:`~satpy.readers.netcdf_utils.PreloadableSegments` and
:class:`~satpy.readers.yaml_reader.GEOSegmentYAMLReader`.

.. versionadded:: 0.52

.. toctree::
:hidden:
:maxdepth: 1

preloaded_reading
41 changes: 41 additions & 0 deletions doc/source/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,47 @@ Clipping of negative radiances is currently implemented for the following reader

* ``abi_l1b``, ``ami_l1b``

.. _preload_settings:

Preloading segments
^^^^^^^^^^^^^^^^^^^

.. note::

Preloading is an advanced topic and experimental. See :ref:`preload`
for details.

* **Environment variable**: ``SATPY_READERS__PRELOAD__ENABLE``
* **YAML-Config Key**: ``readers.preload.enable``
* **Default**: False

Preload segments for those readers where it is supported.

* **Environment variable**: ``SATPY_READERS__PRELOAD__STEP``
* **YAML-Config Key**: ``readers.preload.step``
* **Default**: 2

When preloading, internal time in seconds to check if a file has become
available.

* **Environment variable**: ``SATPY_READERS__PRELOAD__ATTEMPTS``.
* **YAML-Config Key**: ``readers.preload.attempts``.
* **Default**: 300

When preloading, how many times to check if a file has become available
before giving up.

* **Environment variable**: ``SATPY_READERS__PRELOAD__ASSUME_DISTRIBUTED``
* **YAML-Config Key**: ``readers.preload.assume_distributed``.
* **Default**: False

When preloading, assume we are working in a dask distributed environment.
When active, Satpy workers will secede and rejoin while waiting for files.
This might partially avoid the problem that tasks waiting for later files
are blocking workers, while tasks working on earlier files are needlessly
waiting in the queue. However, Satpy has limited compatibility with
dask distributed. Please refer to the note at :ref:`preload` before
considering this option.

Temporary Directory
^^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Documentation
enhancements
writing
multiscene
advanced/index
dev_guide/index

.. toctree::
Expand Down
6 changes: 6 additions & 0 deletions satpy/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,12 @@
"sensor_angles_position_preference": "actual",
"readers": {
"clip_negative_radiances": False,
"preload": {
"enable": False,
"step": 2,
"attempts": 300,
"assume_distributed": False,
},
},
}

Expand Down
100 changes: 73 additions & 27 deletions satpy/etc/readers/fci_l1c_nc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,33 +21,67 @@ file_types:
- "{pflag}_{location_indicator},{data_designator},MTI{spacecraft_id:1d}+{data_source}-1C-RRAD-FDHSI-{coverage}-{subsetting}-{component1}-BODY-{component3}-{purpose}-{format}_{oflag}_{originator}_{processing_time:%Y%m%d%H%M%S}_{facility_or_tool}_{environment}_{start_time:%Y%m%d%H%M%S}_{end_time:%Y%m%d%H%M%S}_{processing_mode}_{special_compression}_{disposition_mode}_{repeat_cycle_in_day:>04d}_{count_in_repeat_cycle:>04d}.nc"
expected_segments: 40
required_netcdf_variables: &required-variables
- attr/platform
- data/{channel_name}/measured/start_position_row
- data/{channel_name}/measured/end_position_row
- data/{channel_name}/measured/radiance_to_bt_conversion_coefficient_wavenumber
- data/{channel_name}/measured/radiance_to_bt_conversion_coefficient_a
- data/{channel_name}/measured/radiance_to_bt_conversion_coefficient_b
- data/{channel_name}/measured/radiance_to_bt_conversion_constant_c1
- data/{channel_name}/measured/radiance_to_bt_conversion_constant_c2
- data/{channel_name}/measured/radiance_unit_conversion_coefficient
- data/{channel_name}/measured/channel_effective_solar_irradiance
- data/{channel_name}/measured/effective_radiance
- data/{channel_name}/measured/x
- data/{channel_name}/measured/y
- data/{channel_name}/measured/pixel_quality
- data/{channel_name}/measured/index_map
- data/mtg_geos_projection
- data/swath_direction
- data/swath_number
- index
- state/celestial/earth_sun_distance
- state/celestial/subsolar_latitude
- state/celestial/subsolar_longitude
- state/celestial/sun_satellite_distance
- state/platform/platform_altitude
- state/platform/subsatellite_latitude
- state/platform/subsatellite_longitude
- time
# key/value; keys are names, value is a list of string on how this may be
# cached between segments or between repeat cycles or neither
attr/platform:
- segment
- rc
data/{channel_name}/measured/start_position_row:
- rc
data/{channel_name}/measured/end_position_row:
- rc
data/{channel_name}/measured/radiance_to_bt_conversion_coefficient_wavenumber:
- segment
- rc
data/{channel_name}/measured/radiance_to_bt_conversion_coefficient_a:
- segment
- rc
data/{channel_name}/measured/radiance_to_bt_conversion_coefficient_b:
- segment
- rc
data/{channel_name}/measured/radiance_to_bt_conversion_constant_c1:
- segment
- rc
data/{channel_name}/measured/radiance_to_bt_conversion_constant_c2:
- segment
- rc
data/{channel_name}/measured/radiance_unit_conversion_coefficient:
- segment
- rc
data/{channel_name}/measured/channel_effective_solar_irradiance:
- segment
- rc
data/{channel_name}/measured/effective_radiance: []
data/{channel_name}/measured/x:
- segment
- rc
data/{channel_name}/measured/y:
- rc
data/{channel_name}/measured/pixel_quality: []
data/{channel_name}/measured/index_map: []
data/mtg_geos_projection:
- segment
- rc
data/swath_direction:
- segment
data/swath_number:
- rc
index: []
state/celestial/earth_sun_distance:
- segment
state/celestial/subsolar_latitude:
- segment
state/celestial/subsolar_longitude:
- segment
state/celestial/sun_satellite_distance:
- segment
state/platform/platform_altitude:
- segment
state/platform/subsatellite_latitude:
- segment
state/platform/subsatellite_longitude:
- segment
time: []
variable_name_replacements:
channel_name:
- vis_04
Expand All @@ -66,12 +100,24 @@ file_types:
- ir_105
- ir_123
- ir_133
# information needed for preloading
segment_tag: count_in_repeat_cycle
time_tags: &tt
- processing_time
- start_time
- end_time
variable_tags: # other tags where RC caching doesn't care about variation
- repeat_cycle_in_day
fci_l1c_hrfi:
file_reader: !!python/name:satpy.readers.fci_l1c_nc.FCIL1cNCFileHandler
file_patterns:
- "{pflag}_{location_indicator},{data_designator},MTI{spacecraft_id:1d}+{data_source}-1C-RRAD-HRFI-{coverage}-{subsetting}-{component1}-BODY-{component3}-{purpose}-{format}_{oflag}_{originator}_{processing_time:%Y%m%d%H%M%S}_{facility_or_tool}_{environment}_{start_time:%Y%m%d%H%M%S}_{end_time:%Y%m%d%H%M%S}_{processing_mode}_{special_compression}_{disposition_mode}_{repeat_cycle_in_day:>04d}_{count_in_repeat_cycle:>04d}.nc"
expected_segments: 40
required_netcdf_variables: *required-variables
segment_tag: count_in_repeat_cycle
time_tags: *tt
variable_tags:
- repeat_cycle_in_day
variable_name_replacements:
channel_name:
- vis_06_hr
Expand Down
32 changes: 29 additions & 3 deletions satpy/readers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (c) 2015-2018 Satpy developers
# Copyright (c) 2015-2024 Satpy developers

Check warning on line 1 in satpy/readers/__init__.py

View check run for this annotation

CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)

❌ New issue: Lines of Code in a Single File

This module has 605 lines of code, improve code health by reducing it to 600. The number of Lines of Code in a single file. More Lines of Code lowers the code health.

Check notice on line 1 in satpy/readers/__init__.py

View check run for this annotation

CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)

✅ Getting better: Overall Code Complexity

The mean cyclomatic complexity decreases from 4.88 to 4.85, threshold = 4. This file has many conditional statements (e.g. if, for, while) across its implementation, leading to lower code health. Avoid adding more conditionals.
#
# This file is part of satpy.
#
Expand Down Expand Up @@ -808,3 +806,31 @@
except AttributeError:
f_obj = unknown_file_thing
return f_obj


def create_preloadable_cache(reader_name, filenames):
"""Create on-disk cache for preloadables.

Some readers allow on-disk caching of metadata. This can be used to
preload data, creating file handlers and a scene object before data files
are available. This utility function creates the associated cache for
multiple filenames.

Args:
reader_name (str):
Reader for which to create the cache.
filenames (List[str]):
Files for which to create the cache. Typically, this would be
the same set of files to create a single scene.
"""
reader_instances = load_readers(filenames, reader_name)
for (nm, reader_inst) in reader_instances.items():
for (tp, handlers) in reader_inst.file_handlers.items():
for handler in handlers:
filename = reader_inst._get_cache_filename(
handler.filename,
handler.filename_info,
handler)
p = pathlib.Path(filename)
p.parent.mkdir(exist_ok=True, parents=True)
handler.store_cache(filename)
djhoese marked this conversation as resolved.
Show resolved Hide resolved
23 changes: 11 additions & 12 deletions satpy/readers/fci_l1c_nc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (c) 2017-2019 Satpy developers
# Copyright (c) 2017-2024 Satpy developers
#
# This file is part of satpy.
#
Expand All @@ -20,12 +18,10 @@

This module defines the :class:`FCIL1cNCFileHandler` file handler, to
be used for reading Meteosat Third Generation (MTG) Flexible Combined
Imager (FCI) Level-1c data. FCI flies
on the MTG Imager (MTG-I) series of satellites, with the first satellite (MTG-I1)
launched on the 13th of December 2022.
For more information about FCI, see `EUMETSAT`_.
Imager (FCI) Level-1c data. FCI flies on the MTG Imager (MTG-I) series
of satellites, with the first satellite (MTG-I1) launched on the 13th
of December 2022. For more information about FCI, see `EUMETSAT`_.

For simulated test data to be used with this reader, see `test data releases`_.
For the Product User Guide (PUG) of the FCI L1c data, see `PUG`_.

.. note::
Expand Down Expand Up @@ -130,7 +126,7 @@
from satpy.readers._geos_area import get_geos_area_naming
from satpy.readers.eum_base import get_service_mode

from .netcdf_utils import NetCDF4FsspecFileHandler
from .netcdf_utils import NetCDF4FsspecFileHandler, PreloadableSegments

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -183,7 +179,7 @@
return channel_name


class FCIL1cNCFileHandler(NetCDF4FsspecFileHandler):
class FCIL1cNCFileHandler(PreloadableSegments, NetCDF4FsspecFileHandler):
"""Class implementing the MTG FCI L1c Filehandler.

This class implements the Meteosat Third Generation (MTG) Flexible
Expand All @@ -208,12 +204,15 @@
"MTI3": "MTG-I3",
"MTI4": "MTG-I4"}

def __init__(self, filename, filename_info, filetype_info):
def __init__(self, filename, filename_info, filetype_info,
*args, **kwargs):
"""Initialize file handler."""
super().__init__(filename, filename_info,
filetype_info,
*args,
cache_var_size=0,
cache_handle=True)
cache_handle=True,
**kwargs)

Check warning on line 215 in satpy/readers/fci_l1c_nc.py

View check run for this annotation

CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)

❌ New issue: Excess Number of Function Arguments

FCIL1cNCFileHandler.__init__ has 5 arguments, threshold = 4. This function has too many arguments, indicating a lack of encapsulation. Avoid adding more arguments.
logger.debug("Reading: {}".format(self.filename))
logger.debug("Start: {}".format(self.start_time))
logger.debug("End: {}".format(self.end_time))
Expand Down
Loading
Loading