Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable Translation using a DerivedVariableRegistry #70

Open
charles-turner-1 opened this issue Nov 13, 2024 · 0 comments
Open

Variable Translation using a DerivedVariableRegistry #70

charles-turner-1 opened this issue Nov 13, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link
Collaborator

Describe the issue

Intake-ESM provides the ability to provide a derived variable registry: a utility class which allows intake-ESM to construct variables on the fly, at query time.

These registries need to provided before the intake-ESM datastore is opened - hence the updates will need to be in intake-dataframe-catalog, approximately here, and would allow us to translate variables & add relevant metadata, potentially feeding into CMORisation efforts.

Properties of Derived Variable Registries

  1. Silent Failure
    • If we attempt to register a derived variable which requires a variable which is unavailable in a dataset, then the derived variable will not be added to the xarray dataset when loaded: eg.
import intake
from intake_esm import DerivedVariableRegistry

dvr = DerivedVariableRegistry()

# Register a variable that should work
@dvr.register(variable='SST', query={'variable' : 'surface_temp'})
def SST(ds):
    ds['SST'] = ds['surface_temp'] - 273.1
    ds['SST'].attrs = {'units' :'degC', 'long_name' : 'sea surface temperature', 'derived_by' : 'intake-esm'}
    return ds
    
# And one that wont - there is no 'surface_pCO2' variable in our dataset
@dvr.register(variable='SSpCO2', query={'variable' : 'surface_pCO2'})
def SSpCO2(ds):
    ds['SSpCO2'] = ds['surface_pCO2'] 
    ds['SSpCO2'].attrs = {'units' :'umol/kg', 'long_name' : 'sea surface partial pressure of CO2', 'derived_by' : 'intake-esm'}
    return ds

>>> dvr
DerivedVariableRegistry({'SST': DerivedVariable(func=<function SST at 0x146ee66d48b0>, variable='SST', query={'variable': ['surface_temp']}, prefer_derived=False), 'SSpCO2': DerivedVariable(func=<function SSpCO2 at 0x146ee5e3f250>, variable='SSpCO2', query={'variable': ['surface_pCO2']}, prefer_derived=False)})

dvr_cat = intake.open_esm_datastore(
    "/home/189/ct1163/derivedvar_test/derived_vars_datastore.json", 
    columns_with_iterables=["variable"], # This is important
    registry=dvr, 
)
fname = '/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output999/ocean/ocean_daily.nc'
>>> print(dvr_cat.search(path=fname).to_dask().data_vars)

Data variables:
    eta_t                  (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    surface_temp           (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    mld                    (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    sfc_hflux_from_runoff  (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    sfc_hflux_coupler      (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    sfc_hflux_pme          (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    frazil_3d_int_z        (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    pme_river              (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    surface_salt           (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    tau_x                  (time, yu_ocean, xu_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    tau_y                  (time, yu_ocean, xu_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    usurf                  (time, yu_ocean, xu_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    vsurf                  (time, yu_ocean, xu_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    average_T1             (time) datetime64[ns] 736B dask.array<chunksize=(92,), meta=np.ndarray>
    average_T2             (time) datetime64[ns] 736B dask.array<chunksize=(92,), meta=np.ndarray>
    average_DT             (time) timedelta64[ns] 736B dask.array<chunksize=(92,), meta=np.ndarray>
    time_bounds            (time, nv) timedelta64[ns] 1kB dask.array<chunksize=(1, 2), meta=np.ndarray>
    SST                    (time, yt_ocean, xt_ocean) float32 4GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>.  
  • Unless doing so causes some unanticipated performance issues, this should mean we can pass a comprehensive list of all variables in - without needing to worry about missing variables breaking the registry.
  1. Although derived variables are computed lazily, intake-ESM handles searching derived variables smoothly: eg.
>>> dvr_cat.search(variable='SST', file_id='ocean_month').to_dask()
<xarray.Dataset> Size: 261GB
Dimensions:       (time: 3360, yt_ocean: 2700, xt_ocean: 3600)
Coordinates:
  * xt_ocean      (xt_ocean) float64 29kB -279.9 -279.8 -279.7 ... 79.85 79.95
  * yt_ocean      (yt_ocean) float64 22kB -81.11 -81.07 -81.02 ... 89.94 89.98
  * time          (time) object 27kB 1900-01-16 12:00:00 ... 2179-12-16 12:00:00
Data variables:
    surface_temp  (time, yt_ocean, xt_ocean) float32 131GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
    SST           (time, yt_ocean, xt_ocean) float32 131GB dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
Attributes:
    filename:                        ocean_month.nc
    title:                           ACCESS-OM2-01
    grid_type:                       mosaic
    grid_tile:                       1
    intake_esm_vars:                 ['surface_temp']
    intake_esm_attrs:filename:       ocean_month.nc
    intake_esm_attrs:file_id:        ocean_month
    intake_esm_attrs:frequency:      1mon
    intake_esm_attrs:realm:          ocean
    intake_esm_attrs:_data_format_:  netcdf
    intake_esm_dataset_key:          ocean_month.1mon

>>> dvr_cat.search(variable='SSpCO2', file_id='ocean_month') # Search the SSpCO2 variable, which we know won't work.
derived_vars_datastore catalog with 0 dataset(s) from 0 asset(s):
...

NB. You still search on variable - not derived variable, and intake-ESM will return the original variable that was used to translate the variable too.

  1. Derived variable registries can be used to create user defined variables, as well as translation: see docs. I think it should be possible to allow a user to hook into this and register their own derived variables in order to search for the subset of catalogued datasets from which they could obtain that derived variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

1 participant