Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Petition to make xscen a dependency of miranda #128

Open
aulemahal opened this issue May 15, 2023 · 17 comments
Open

Petition to make xscen a dependency of miranda #128

aulemahal opened this issue May 15, 2023 · 17 comments

Comments

@aulemahal
Copy link
Collaborator

Sorry everybody, I am again advocating for the integration of xscen into miranda. This came while working on the new schema...

Here are my arguments.

  1. Both packages need to be able to manage catalogues of datasets, parse directories and build directories. They both must share the same column definitions for our work to work. Why split the definitions ? Currently, the list of available columns is defined n both places. While miranda has an automatic validator, xscen has a human-readable documentation of them. Both have mappings to translate common frequency names unto standardized vocab, while only xscen has mappings to the "xrfreq" codes.

  2. They both have a pretty large "common scope". Miranda takes raw data from different sources and creates nicely formatted files. In theory, xscen takes up from that point. However, in practice most operations miranda does once the files are opened kinda falls in xscen's scope too.

  3. Earlier, an very good argument against was that xscen was private and internal. It isn't anymore as it's now on anaconda and pypi. And work is already been done to add testing, our criterion before moving xscen to conda-forge.

  4. Another good argument was that the dependencies were too different, installing both would result in many useless packages for the other. Since a few months back, miranda now uses xESMF. I installed xscen from anaconda in an empty env and looked at which dependencies it pulled. The packages that xscen pulled that are not in miranda's env (from environment.yml) are:

{'cartopy',  'flox',  'libarchive',  'lzo',  'numpy_groupies',  'pyshp',  'python-tzdata',  'rechunker',  'xscen'}

where libarchive, lzo and python-tzdata are low dependencies that were not in miranda's env because of older packages versions for gdal and pandas.

This means the real extras are cartopy, flox and rechunker. The first is there because of the temporary solution to compute grid cell bounds on rotated pole data, waiting for a fix in cf-xarray, maybe miranda is interested in this function ? I would argue that flox should be added to miranda's env for performance improvements. And I'm pretty sure miranda could make use of rechunker.

  1. They are both developed by the same team. Why duplicate our efforts ?
@aulemahal
Copy link
Collaborator Author

In practice, I think the first step here would be to:

  • Move all frequency word handling to xscen.
  • Merge both date_parser functions.
  • Move path building stuff to xscen.

Afterwards, we could think of using more xscen's stuff in miranda's data corrections, file saving, etc. Merging the common functions along the way.

May be that one day, a "data ingestion pipeline" would be some kind of xscen workflow, making use of the yaml config and catalogs.

@Zeitsperre
Copy link
Collaborator

Thanks for summarizing these justifications. I don't know much of the internals of xscen, so getting a sense of how they work together now that both projects are beginning to mature (adolescent phase?), this is useful knowledge to have on hand.

The common scope for both projects does make it enticing to make one a dependency of the other. While a full installation of Miranda shares a lot in common with the base installation for xscen, there was a lot of work put in to move a lot of these dependencies into separate install recipes. Miranda is light enough to work in a pure python environment, and a lot of effort was made to move the external API-handling functionality (intake, intake-esm, fabric, cdsapi) to being completely optional. I'm very hesitant to call xesmf a dependency of miranda as it only appears in one function and is handled in the case that it is not installed.

The environment.yml encompasses everything possible at the moment, but there are some modules that still need to be pruned/optimized, so there's a possibility that even more will be removed/made optional. Making xscen a dependency goes against this idea.

The primary machinery in Miranda is much lower-level than an all-encompassing workflow system like xscen, and I would counter-propose that Miranda should be seen as a dependency for xscen, or that we should be creating a utility library to synchronize our templates instead

  1. The structural needs for xscen benefit running workflows for climate scenario development. I agree that miranda and xscen should be developed, but the (hypothetical) user of miranda is only interested in gathering (optionally), cleaning up, and structuring data for databases. Making xscen a core dependency adds a bunch of functionality that is not useful to the general user.
  2. A base installation of miranda does not require half as many dependencies as xscen (and many of those listed under requirements.txt, particularly the plotting and GIS libraries, are holdovers from previous versions, i.e. many can be removed). Xscen has many complex-to-install dependencies and can only be installed in a conda-forge environment.
  3. The problem that we are trying to solve is that there are common configurations (file structure, facet dictionaries) that are specific to Ouranos that both projects rely upon. We have the tooling, knowledge and resources to solve this problem specifically. Why not tackle this problem directly by creating a handful of common dictionaries/configurations for both?
  4. Both of these libraries have been undergoing massive changes in API and structure in the last few months, so there's also something to be said about needing to be wary of changes in both when developing.

It likely comes as no surprise that I'm really not convinced that this is a good idea. Yes, there are commonalities for both, but I think there's a much better way (and would be willing to place effort in solving this problem more directly).

@juliettelavoie
Copy link
Collaborator

juliettelavoie commented May 15, 2023

This is just my opinion but I think Ouranos users should be priotitize over future external users of Miranda (or xscen for that matter). When deciding what to do, we should keep in mind the non-developper current xscen users. It is not ideal that they should have to go dig into miranda to make xscen work.

Also, I think we might be mixing up 2 things:

  1. The Ouranos specific configuration, i.e., the structure
  2. the utils that are useful for both projects, i.e., date_parser, handling frequencies, handling chunks

We might have 2 solutions for those 2 things.

@aulemahal
Copy link
Collaborator Author

(@Zeitsperre maybe we should stick to the issue for the discussion, so everyone is able to participate)

For the "configuration" issue, one thought and one idea:

  • Both miranda and xscen make use of hardcoded columns names with logic based on the expected values from those columns. Pulling out the "configuration" out might not be easy.
  • ouranos_data_catalog might be the best place for this. It would store the column names, definitions, validators, the schema and etc.

As for the "common tooling" issue, my fear is that the solution for duplicate code is to add one more package to maintain. Not sure its very time efficient for us. Also, it adds a layer for the user, documentation is spread over one more package.

And a final thought, mainly about the dependencies issue. Who works with non-conda virtual environments ? Personally, since the arrival of Mamba, I don't think I'll suggest anything else to users, at least in the near and mid future. And conda makes the usage of "optional" dependencies kinda irrelevant : the standard way to doing things is installing everything. Which is to say that it is normal under this convention to have packages that you only use in edge cases, and to have packages with a much larger or barely-overlapping scope.

This said, I'll continue thinking about other solutions. To be clear : my primary goal is reducing the workload.

@juliettelavoie
Copy link
Collaborator

A potentiel con for putting the structure in ouranos-data-catalog: Harder to share our workflows with partners. Real-life example: Hydro-Québec might use the info-crue-cmip6 workflow as a base for their workflow. That code expects our columns. If this info is hidden, it will be harder for them to create catalogs in the right format.

Are we ready to share access to ouranos-data-catalog to everyone we want to share our workflows with ?

@aulemahal
Copy link
Collaborator Author

The very point of ouranos_data_catalog is to be private. If something is to be shareable, it shouldn't be there.

Maybe we are aiming to high with getting the list of columns outside xscen. Only the folder schema could be enough. We'd still have an example one in xscen.

@aulemahal
Copy link
Collaborator Author

aulemahal commented May 16, 2023

I thought about it and my preferred solution for the moment is quite status quo-like.

  • Copy paste the schema building code from miranda to xscen
  • Copy paste what's missing for xrfreq handling from xscen to miranda (No need as long as miranda doesn't handle derived data)
  • Put the "ouranos_schema.yml" file in ouranos_data_catalog. Move that repo to the gitlab.

(Moving to gitlab will allow simple "wget" requests to get the file from within the Ouranos network. You can't do that with a private github repo (needs authentication). )


And because I can't stop, another argument I thought of this morning. In the future, when we will want to upgrade code relating to the common tooling of xscen and miranda, chances are that the first implementation will happen in xscen. Implementation in miranda will usually wait a immediate need for that change, or depend on @Zeitsperre's energy and available time. In this aspect, xscen as a dependency of miranda is a way to lower the workload for miranda because most development effort will be made in xscen in any case.

@RondeauG
Copy link
Collaborator

RondeauG commented May 18, 2023

Sorry for the delay in this discussion. I don't know enough about miranda (especially since the refactoring) to be of much use in discussing details, but if we don't want to create a 3rd separate package, I would advocate for some kind of lightweight submodule in xscen.

I'm specifically thinking of catalog, io, config, and scripting, with potentially some general functions in utils mixed in that too.

  • import xscen would still import everything as is.
  • import xscen.utils_submodule_name would only import that subset. We could even include an environment file specific to it, without dependencies for cartopy or xesmf.
    • Even better if we find a way to enable pip/mamba install xscen-light

@aulemahal
Copy link
Collaborator Author

I like the idea, but I see some caveats that might influence the choice:

I'm not sure import xscen.utils would work though, AFAIU this would trigger import xscen anyway. I think the same idea should be implemented by isolating each "complex" parts into submodules. For example, this code :

import xscen as xs

xs.load_config(...)
xs.regrid_dataset(...)
xs.spatial_mean(...)

would become:

import xscen as xs
from xscen.regrid import regrid_dataset
from xscen.aggregate import spatial_mean

xs.load_config(...)
regrid_dataset(...)
spatial_mean(...)

with this approach we would be able to make many dependencies optional, the same way miranda does.

Note that this would make most sense in a non-conda venv. AFAIK, conda has no notion of optional dependencies installs. However, we could "hack" a xscen-light package in our own anaconda channel.


All that said, I realized yesterday that as long as "frequency" is the field we use in the structure of monthly/daily/hourly paths, miranda has no immediate need for handling "xrfreq". This way, only the path building code needs to be duplicated.

@Zeitsperre
Copy link
Collaborator

Note that this would make most sense in a non-conda venv. AFAIK, conda has no notion of optional dependencies installs. However, we could "hack" a xscen-light package in our own anaconda channel.

Innovative solution, but I kind of hate it, haha. I would rather see useful functions migrate to another library than rely on a pseudo-offical conda-forge package hack.

Here's an alternative: We make a third helper library with all the io, catalog, validation functions, then we perform an import like so:

In miranda.io or xscen.io

from helper_func.io import *
from helper_func.io import __all__ as __all_io__

__all__ = __all_io__

I just performed a cursory review of the dependencies in miranda and there's likely more things that can be safely removed. Having a common package that doesn't move once we have a common set of functions is much easier to manage.

For common configurations, I want to propose an approach in spirograph and see whether we can port it to other libraries as well.

@juliettelavoie
Copy link
Collaborator

juliettelavoie commented May 18, 2023

  1. Is it really realistic that this 3rd library would not move ?
  2. Would there be a way to have this 3rd library be integrated in the xscen docs, in order to make it invisible to xscen users ? I feel like people are already mixed up between xclim/xscen and where to find the right info haha
  3. I really don't like
from xscen.aggregate import spatial_mean
spatial_mean(...)

that makes it way harder to read. This is what the beginning of ESPO-G is like right now, because the imports were not working when I wrote it. It caused a lot of confusion for Marco.

@RondeauG
Copy link
Collaborator

RondeauG commented May 18, 2023

  1. I really don't like
from xscen.aggregate import spatial_mean
spatial_mean(...)

Trevor's solution would prevent that. Given that xhydro will probably also want a bunch of functions in io, scripting, config, I would advocate for that separate library.

(Trevor also tells me that 2. is possible)

@juliettelavoie
Copy link
Collaborator

Doesn't xhydro need way more than potential_3rd_lib ? On your figure, ensemble, aggregate and indicators are in green.

@RondeauG
Copy link
Collaborator

RondeauG commented May 18, 2023

Maybe not. It'll be discussed in 2 weeks during an xhydro meeting, but instead of having a wrapper of a wrapper, I think those can be kept within xscen only, with an hydroclimatological workflow simply importing both xhydro (to run the model) and xscen (to make ensemble stats). The only one that I'm not sure about is indicators.

@aulemahal
Copy link
Collaborator Author

I get anxiety when thinking of adding another package to our workload... I can already barely keep up with xscen and xclim, and I have difficulties to give miranda the attention it deserves. I personally prefer the status quo to this.

@RondeauG
Copy link
Collaborator

I don't see that library as changing too often; it would be pretty self-contained and stable.

  • io, scripting were last modified in October 2022
  • config had minor changes fairly recently, but otherwise hasn't changed in a long while.
  • catalog might be the one with the most changes, but also the most links between xscen and miranda.

@Zeitsperre
Copy link
Collaborator

Zeitsperre commented May 18, 2023

@aulemahal

The beauty of this is that Miranda can continue to exist as a library that is built for one or two things (optionally, 3). The functions that relate to xscen that you'll be focusing on won't exist in Miranda any more (and I would be proposing enhancements that would affect both).

  1. Is it really realistic that this 3rd library would not move ?

Even if it does move, the functions that xscen and miranda (and xhydro) require would not. If we have very clean call signatures and returns on functions, then they won't change. New functions can be added to the third party library without impacting existing functions.

  1. Would there be a way to have this 3rd library be integrated in the xscen docs, in order to make it invisible to xscen users ? I feel like people are already mixed up between xclim/xscen and where to find the right info haha

We make use of intersphinx in Birdhouse to link documentation pages between projects. We already have experience using this plugin. Very easily performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants