-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using intake-esm to load an ensemble: real-world problem that might lead to a tutorial/example #444
Comments
[tip: when adding code in within triple backticks, if you add I edited the post above to add that ;) ] |
Thanks for documenting this use-case @jemmajeffree. Can we specify what the "typical" cluster size / resources should be to constrain this problem? For example "how much compute resource does the typical COSIMA affiliated student or post-doc have"? |
Thanks @navidcy @Thomas-Moore-Creative, I was running this on a XL ARE job, which is a bit more than I'd usually be using for data crunching but I was feeling impatient. Most of the work I do is piggybacking pre-existing model runs, which is fairly low compute — as a ballpark estimate, I've used 10 kSU for my PhD so far which involved analysing 9 different CMIP runs, but I understand that this is almost nothing for anyone running models |
@Thomas-Moore-Creative I'm not sure there's an easy answer to that question. I'd say most people within COSIMA have access to pretty generous compute, and because the cost of running the models far exceeds the cost of the analysis, we tend not to spend a lot of time thinking about the compute resources for analysis. A modelling based PhD student would likely use hundreds of kSU over the course of their degree. As an example, running the 0.25° version of ACCESS-OM2 with BGC consumes 13.1 kSU/year. So a 60 year run is ~0.8 MSU. The 0.1° version is much more expensive. Whereas a 24 hour session of the XL ARE cluster that @jemmajeffree mentioned is 420 SU. In general, the problem for analysis is human time, not compute time. |
Thanks Anton
With intake loading data now, I have a couple more comments on efficiency. I can get xr.open_mfdataset to work more than twice as fast as intake. To my understanding, intake wraps around open_dataset not open_mfdataset, so the optimisations for open_mfdataset don't transfer. Is that correct? In which case, the tiny chunks are a problem. I'd rather not pull everything into memory, though I agree that in the small example I've provided here it is a solution. Even doing that, the data loads faster with open_mfdataset and different chunking than intake. pr_daily = xr.open_mfdataset([[filepath(str(m),'pr','day',time) for m in range(1,41)] for time in ('18500101-18991231','19000101-19491231','19500101-19991231','20000101-20141231')],
coords='minimal', #Grab medium speed-up by not checking coordinates align
compat='override', #Grab medium speed-up by not checking coordinates align
combine = 'nested',
concat_dim = ('time','SMILE_M',),
preprocess = lambda x: x['pr'], #Often doing something more complicated in here that gets bonus efficiency
parallel = True,
chunks={'SMILE_M':1,'time':365,'lat':-1,'lon':-1}, #Chunks are big enough that dask scheduling overhead goes away
).assign_coords(SMILE_M = np.array([str(m) for m in range(1,41)])) Am I missing any intake optimisations to achieve comparable performance? Direct comparisons of speed for loading and example computations are here: /scratch/nf33/jj8842/intake comparison.ipynb I can see that I should have put this on hive to begin with, and it's probably more valuable there digging into efficiencies with intake - would you recommend moving it or is that overcomplicating things? |
SIDEWAYS question, @anton-seaice et al. I'm assuming "l" stands for "link" and "f" for file and my anecdotal experience is results can be different and not specifying leads to duplication. Anyone have more insight and link to authoritative NCI documentation? |
@Thomas-Moore-Creative, empirically, using "l" gets you just the most recent ACCESS-ESM1-5 data files (stored in a version file "latest" symlink not the actual most recent version) rather than other bits and pieces, but I couldn't tell you what it means |
Thanks for your experience on this. Sounds like you've tested this? Do we know if NCI documents this anywhere? The very useful ACCESS-NRI intake catalog docs from @dougiesquire use "f" in the example here
but that example might not be practice if you want the latest CMIP file versions? Also, on the
and that this approach on chunking could be used:
I use this with ACCESS-ESM1.5 data in ways like this:
I will find some time to run your example Without looking at it I wonder if the list of file paths you are sending to I also note that if you use Apologies if these comments are obvious or unhelpful given the work you've already done. |
Starting from the bottom: |
Depending on your needs, it might be. I probably do kinda stupid things all the time. 😄 |
Not that I know of, you could use something like this:
Or write some better regex to get a better match :) (Interestingly
It looks like they are read in alphabetical order. Its safe to assume sortby will just change the order of the index, and not move anything else around in memory.
Sorry - I missed this the first time. This dataset has more than one timestep per file ! So you can supply a time chunksize which reduces the number of times the file has to be opened / read and improves caching efficiency: e.g. for the monthly data:
for the daily data, those chunks would be too big, but size of 365 could be good:
The training fairies has been cleaning up nf33, so I can't see this! Note that caching makes good time comparisons hard!
The only time I looked, the link just point to the file and they were the same. But I didn't look exhaustively at all.
My only suggestion would be the NCI helpdesk |
@anton-seaice Can you see x77? g40? v45? I put it in nf33 because it was the only directory I knew you would have access to |
v45 is ok but I think you can use |
It's now in /scratch/public/jj8842 |
The timing is not really reliable for performance comparisons, its an indication but changed wildly. i.e. I ran this notebook with the cells in the opposite order, and open_mfdataset was slower. Adding
as an argument in you also could supply xarray_combine_by_kwargs in (e.g.
) but i dont think its necessary |
With those additional keywords, to_dataset_dict is working faster than xr.open_mfdataset. The concatenate a dictionary and then reorder dimensions is annoying, but I can work around it. For any future readers of this thread: the documentation for what arguments to_dataset_dict takes is here https://intake-esm.readthedocs.io/en/stable/reference/api.html (I couldn't find it at the time I was originally trying to read in data, and had no idea xarray_combine_by_kwargs existed) |
@anton-seaice - re: using |
@Thomas-Moore-Creative I found |
@jemmajeffree are you happy with where you landed in your "intake comparison" example? I'm going to compare it to my current approach for ACCESS-ESM1.5, see what I can learn from it, and possibly add to it. |
😃 |
I'm still not really happy with the way the ensemble members are combined, but I've ceased messing with it or trying to improve the approach. |
I don't really have an informed view ... I've found it can be totally arbitrary. One day its 20 seconds, the next day its 2 minutes for no apparent reason. I suspect restarting the kernel is a reasonable approach, I don't know to what extent / where data may be cached at a more fundamental level. |
I do think my "view" is more a "feeling" and "opinion" ( feelpinion? ) from trial and error rather than a fundamental understanding of how things like |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
@jemmajeffree, wondering if you settled on how to best retain the ensemble member name labels on loading and if you feel comfortable on how robust / reliable it is? I am trying the below approach ( which is likely very similar to @anton-seaice suggestion above ) based on the
|
@Thomas-Moore-Creative I'm afraid not; I went back to what I was originally doing reading from individual filepaths |
@anton-seaice after some head-banging I did reach out to NCI about "f" vs "l". Apparently "f" means @jemmajeffree if you or others are still using and relying on |
What I want to do is this:
but I understand that intake-esm should make these type of things easier and less hard-coded. So I tried to do this:
Which threw me errors, as did any variant on to_dask I tried (mostly about being unsure of how to combine the ensemble members together)
The text was updated successfully, but these errors were encountered: