diff --git a/docs/source/index.rst b/docs/source/index.rst index 4572bd4..15a2589 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -68,6 +68,7 @@ so that blocks from one or more files can be arranged into aggregate datasets ac beyond nonzarr reference + reference_aggregation contributing advanced diff --git a/docs/source/reference_aggregation.rst b/docs/source/reference_aggregation.rst new file mode 100644 index 0000000..ee8e28f --- /dev/null +++ b/docs/source/reference_aggregation.rst @@ -0,0 +1,224 @@ +Aggregation special cases +============================= + +As we have already seen in this `page `_, +that the main purpose of ``kerchunk`` it to generate references, to view whole archive +of files like GRIB2, NetCDF etc, allowing us for direct access to the data. In +this part of the documentation, we will see some other efficient ways of +combining references. + +GRIB Aggregations +----------------- + +This reference aggregation method of GRIB files, developed by `Camus Energy `_, +and it functions if accompanying ``.idx`` files are present. It involves creating a reference index +for every GRIB message across the files that we want to aggregate. + +**But this procedure has some certain restrictions:** + + - GRIB files must paired with their ``.idx`` files + - The ``.idx`` file must be of *text* type. + - Only specialised for time-series data, where GRIB files + have *identical* structure. + - Each horizon(forecast time) must be indexed separately. + + +Utilizing this method can significantly reduce the time required to combine +references, cutting it down to a fraction of the previous duration. The original +idea was showcased in this `talk `_. +It follows a three step approach. + +**Three step approach:** + + 1. Extract and persist metadata directly from a few arbitrary grib + files for a given product such as HRRR SUBH, GEFS, GFS etc. + 2. Use the metadata mapping to build an index table of every grib + message from the ``.idx`` files + 3. Combine the index data with the metadata to build any FMRC + slice (Horizon, RunTime, ValidTime, BestAvailable) + + +*How is it faster* + +The ``.idx`` file otherwise known as an *index* file contains the key +metadata of the messages in the GRIB files. These metadata include `index`, `offset`, `datetime`, +`variable` and `forecast time` for their respective messages. This metadata +will be used to index every GRIB message. By following this approach, we only have to +``scan_grib`` a single GRIB file, not the whole archive. + +Building the index of a time horizon, first requires a single one-to-one mapping of GRIB/Zarr +metadata to the attributes in the idx file. Only constraint is the mapping needs to be +made from a single GRIB file, belonging to the *same time horizon*. The indexing process +primarily involves the `pandas `_ library. To confirm this, +see this `notebook `_. +After indexing a single time horizon, you can combine this index with indexes of +other time horizon and store it. + +.. note:: + The index in ``.idx`` file indexes the GRIB messages where as the ``k_index`` + (kerchunk index), index the variables + in those messages. + +The table mentioned below is a *k_index* made from a single GRIB file. + +.. list-table:: k_index for a single GRIB file + :header-rows: 1 + :widths: 5 10 15 10 20 15 10 20 20 30 10 10 10 + + * - + - varname + - typeOfLevel + - stepType + - name + - step + - level + - time + - valid_time + - uri + - offset + - length + - inline_value + * - 0 + - gh + - isobaricInhPa + - instant + - Geopotential height + - 0 days 06:00:00 + - 0.0 + - 2017-01-01 06:00:00 + - 2017-01-01 12:00:00 + - s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z... + - 0 + - 47493 + - None + * - 1 + - t + - isobaricInhPa + - instant + - Temperature + - 0 days 06:00:00 + - 0.0 + - 2017-01-01 06:00:00 + - 2017-01-01 12:00:00 + - s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z... + - 47493 + - 19438 + - None + * - 2 + - r + - isobaricInhPa + - instant + - Relative humidity + - 0 days 06:00:00 + - 0.0 + - 2017-01-01 06:00:00 + - 2017-01-01 12:00:00 + - s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z... + - 66931 + - 10835 + - None + * - 3 + - u + - isobaricInhPa + - instant + - U component of wind + - 0 days 06:00:00 + - 0.0 + - 2017-01-01 06:00:00 + - 2017-01-01 12:00:00 + - s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z... + - 77766 + - 22625 + - None + * - 4 + - v + - isobaricInhPa + - instant + - V component of wind + - 0 days 06:00:00 + - 0.0 + - 2017-01-01 06:00:00 + - 2017-01-01 12:00:00 + - s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z... + - 100391 + - 20488 + - None + + +*What now* + +After creating the k_index as per the desired duration, we will use the ``DataTree`` model +from the `xarray-datatree `_ to view a +part(desired variables) or the whole of the aggregation, using the k_index. Below is a +tree model made from an aggregation of GRIB files produced from **GEFS** model hosted +in AWS S3 bucket. + +.. code-block:: bash + + DataTree('None', parent=None) + ├── DataTree('prmsl') + │ │ Dimensions: () + │ │ Data variables: + │ │ *empty* + │ │ Attributes: + │ │ name: Pressure reduced to MSL + │ └── DataTree('instant') + │ │ Dimensions: () + │ │ Data variables: + │ │ *empty* + │ │ Attributes: + │ │ stepType: instant + │ └── DataTree('meanSea') + │ Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1, + │ model_horizons: 1, valid_times: 237) + │ Coordinates: + │ * latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0 + │ * longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0 + │ meanSea float64 8B ... + │ number (time, step) int64 8B ... + │ step (model_horizons, valid_times) timedelta64[ns] 2kB ... + │ time (model_horizons, valid_times) datetime64[ns] 2kB ... + │ valid_time (model_horizons, valid_times) datetime64[ns] 2kB ... + │ Dimensions without coordinates: model_horizons, valid_times + │ Data variables: + │ prmsl (model_horizons, valid_times, latitude, longitude) float64 124MB ... + │ Attributes: + │ typeOfLevel: meanSea + └── DataTree('ulwrf') + │ Dimensions: () + │ Data variables: + │ *empty* + │ Attributes: + │ name: Upward long-wave radiation flux + └── DataTree('avg') + │ Dimensions: () + │ Data variables: + │ *empty* + │ Attributes: + │ stepType: avg + └── DataTree('nominalTop') + Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1, + model_horizons: 1, valid_times: 237) + Coordinates: + * latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0 + * longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0 + nominalTop float64 8B ... + number (time, step) int64 8B ... + step (model_horizons, valid_times) timedelta64[ns] 2kB ... + time (model_horizons, valid_times) datetime64[ns] 2kB ... + valid_time (model_horizons, valid_times) datetime64[ns] 2kB ... + Dimensions without coordinates: model_horizons, valid_times + Data variables: + ulwrf (model_horizons, valid_times, latitude, longitude) float64 124MB ... + Attributes: + typeOfLevel: nominalTop + + +.. tip:: + For a full tutorial on this workflow, refer this `kerchunk cookbook `_ + in `Project Pythia `_. + +.. raw:: html + +