Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to MAPL 2.23.1 and ESMA_env v4.2.0 #568

Merged
merged 4 commits into from
Jul 16, 2022

Conversation

mathomp4
Copy link
Member

@mathomp4 mathomp4 commented Jul 6, 2022

This is a PR to update GEOSldas to use MAPL v2.23.0 v2.23.1. Note that this update also requires a newer ESMA_env (aka Baselibs). This was due to needing some new interfaces in yaFyaml for use with ExtData2G.

All testing with GEOSgcm is zero-diff, so I'm labeling as such.

The equivalent GEOSgcm PR is GEOS-ESM/GEOSgcm#428

NOTE: For now I'll keep this draft because I want to make sure the CI works. This is not anything crucial for the LDAS so it can wait until @biljanaorescanin is back from her leave so she can do full testing.

@mathomp4 mathomp4 added the 0-diff label Jul 6, 2022
@mathomp4 mathomp4 self-assigned this Jul 6, 2022
@gmao-rreichle
Copy link
Contributor

@mathomp4, thanks for keeping us current. I went ahead and ran the tests, in the hope that it would work out and we can just merge the PR. Alas, I'm getting somewhat odd results. All Intel "model" tests pass, but the Intel "assim" tests fail to run. The error messages are as follows:
pe=00017 FAIL at line=03610 MAPL_HistoryGridComp.F90 <CURRENT.SMAP_L4_SM_gph.20171015_0130z.bin being created for History output already exists>
I started with a perfectly clean $NOBACK/Systemtests directory, so not sure why MAPL would complain about a file already existing.
As for the GNU tests, the build failed for me. Is there something that needs to be tweaked with the test script?
My tests are in /discover/nobackup/rreichle/Systemtests
I ran the full suite last night, and just the GNU tests again this morning. I kept messing up the command line args of my ./LDAS_Nightly Cron.csh submission for the last ("gnudebugconus") test, but the outcome is still that GNU failed to build for me.
We might have to wait until @biljanaorescanin returns, after all.

@mathomp4
Copy link
Member Author

mathomp4 commented Jul 7, 2022

@mathomp4, thanks for keeping us current. I went ahead and ran the tests, in the hope that it would work out and we can just merge the PR. Alas, I'm getting somewhat odd results. All Intel "model" tests pass, but the Intel "assim" tests fail to run. The error messages are as follows: pe=00017 FAIL at line=03610 MAPL_HistoryGridComp.F90 <CURRENT.SMAP_L4_SM_gph.20171015_0130z.bin being created for History output already exists> I started with a perfectly clean $NOBACK/Systemtests directory, so not sure why MAPL would complain about a file already existing.

Interesting. This is due to a new fix in MAPL from @bena-nasa detailed in:

GEOS-ESM/MAPL#1565

based on tests with the ADAS:

GEOS-ESM/GEOSadas#188

where Ben was trying to run the ADAS and found that the scripting wasn't cleaning up background files in a good way and History was essentially trying to write to a file that already existed.

Ben is on leave until next week, but maybe you can read through the GEOSadas issue and see if you can figure out what's happening? He was pretty confident that if History ever hit this, something was wrong.

As for the GNU tests, the build failed for me. Is there something that needs to be tweaked with the test script? My tests are in /discover/nobackup/rreichle/Systemtests I ran the full suite last night, and just the GNU tests again this morning. I kept messing up the command line args of my ./LDAS_Nightly Cron.csh submission for the last ("gnudebugconus") test, but the outcome is still that GNU failed to build for me. We might have to wait until @biljanaorescanin returns, after all.

Ah. These will fail for sure as the nightly test scripts are still pointing to an older-Baselibs g5_modules for GNU. It's due to the clunkiness of GEOS' handling of modules. I'll see if maybe there is a way I can have an option to "pass in" a different Baselibs version to the script so one can override the default.

@gmao-rreichle
Copy link
Contributor

@mathomp4: Thanks for the info. I think I sort of understand the ADAS issue reported by @bena-nasa, but I don't quite see how it would apply to the GEOSldas test case. Here, we're just writing a regular, 3-hr avg output file. There's no 2-step dance between the predictor and corrector segments, so this particular time-stamped file is only written once.
The binary, tile-space file written here is not unique to the "assim" test cases. For example, LDAS_GLOBAL/model case also writes binary, tile-space output but does not throw this error. Also, the "assim" cases are not the only cases that use more than one compute node. Some of the "model" cases do as well. So I'm not seeing anything obvious in the HISTORY or layout config that sets the "assim" cases apart from the "model" cases.
The only thing I can think of is that there's something unique to the "assim" cases about the sequencing of the GridComps that make up GEOSldas. In the "assim" cases, we have the ensemble GridComp (GEOSens_GridComp), which isn't engaged in the "model" cases. Perhaps Ben's check whether the file exists comes too late, and the file has already been initialized somehow? But now I'm just throwing out wild guesses.
If someone can reproduce my error, then fixing it likely requires both @bena-nasa and @weiyuan-jiang, which means later this month at the earliest.
Maybe @biljanaorescanin can hack the GNU tests next week, which would show us if the HISTORY error also happens in GNU.

@bena-nasa
Copy link
Contributor

@gmao-rreichle
Can you post or point me to the History.rc file and the collection that is causing the problem. I'm not seeing any logic problem here. My only two thoughts are

  1. It looks like this is still the binary output, maybe there's some corner case with that which is not apparent from looking at the code. If that is the case once I can see the History.rc and collection the run is using maybe I can reproduce.
  2. Occam's razor, if it is says the file exists, it must exist despite thinking it should not

@gmao-rreichle
Copy link
Contributor

@bena-nasa, the problem occurs with the standard LDAS Nightly test suite. For reference, my most recent (failed) runs are still in
/discover/nobackup/rreichle/SystemTests/
Specifically, the problem occurs with the "LDAS_GLOBAL/assim" test. The run dir (incl HISTORY.rc) for this test is here:
/discover/nobackup/rreichle/SystemTests/runs/LDAS_GLOBAL/assim/CURRENT/run/
You can run the tests yourself following the instructions here:
https://github.com/GEOS-ESM/GEOSldas/wiki/LDAS-Nightly-Tests
You can skip Step 1. Note that you must run two tests: "-t conus -t global" (the build only happens with "conus").
Note also that LDAS_Nightly_Cron.csh has some GNU settings hardwired, so the GNU tests won't build until the script is updated.
Let me know if you have more questions.

@bena-nasa
Copy link
Contributor

@gmao-rreichle @biljanaorescanin
I tried running the model with a collection just like the collection you are having problems with (/gpfsm/dnb44/mathomp4/SystemTests/runs/LDAS_GLOBAL/assim/CURRENT/run/HISTORY.rc) like so:

  goswim_catch.template:       '%y4%m2%d2_%h2%n2z.nc4',
  goswim_catch.mode:           'time-averaged',
  goswim_catch.frequency:       030000,
  goswim_catch.ref_time:        000000,
  goswim_catch.fields:         'SNDZN1'    , 'CATCH'      ,

It had no problem writing this.
Therefore I have to conclude that the answer is Occam's razor, it is failing because prior to the execution of the GEOSldas.x the file exists. Indeed, when Matt did an ls in the scratch directory for his test the file exists and has a size of 0.

Is the scripting maybe touching this file? I really don't think there is any logic problem with MAPL but it really seems like somehow something touched this file before GEOSldas.x executed.

@gmao-rreichle
Copy link
Contributor

@bena-nasa : Thanks. What was the config of your model run? Single-member or ensemble? I'm asking because my only hunch is that the scripting and model execution differs between single-member, no perturbation simulations and ensemble simulations with perturbations. Here's my earlier comment on this:

The only thing I can think of is that there's something unique to the "assim" cases about the sequencing of the GridComps that make up GEOSldas. In the "assim" cases, we have the ensemble GridComp (GEOSens_GridComp), which isn't engaged in the "model" cases.

@biljanaorescanin
Copy link
Contributor

Here are run scripts for tests: /home/mathomp4/SI_Team/scripts/systest/input/LDAS
One ensemble for global model case and 24 for assim case.

@bena-nasa
Copy link
Contributor

bena-nasa commented Jul 11, 2022

@gmao-rreichle @biljanaorescanin This was a stock GESOgcm.x run which can not run ensembles but I don't see how that would matter once the GEOSldas.x executes. When you run ensembles you still run a single executable that runs via MAPL_Cap? Does each ensemble have it's own MAPL_CapGridComp (so would have its own history instantiation) or is it an "ensemble" in the sense that below the single capgridcomp, under the root gridcoomp you have multiple children of somebody? I assume it is the latter.

Still would bet 4 figures this is a scripting problem. @biljanaorescanin why don't you try putting an exit in the script right before (and by that I literally mean the line before the GEOSldas.x executes). Let's see where the problem lies definitively, either the file will be there or it won't in the scratch or temporary directory GEOSldas.x will be run from.

@gmao-rreichle
Copy link
Contributor

or is it an "ensemble" in the sense that below the single capgridcomp, under the root gridcoomp you have multiple children of somebody? I assume it is the latter.

Yes, it's the latter.

Still would bet 4 figures this is a scripting problem. @biljanaorescanin why don't you try...

It's worth a try for @biljanaorescanin to do this, but if this doesn't point to the solution easily, it's probably best to wait for @weiyuan-jiang to be back in the office.

Also, it's still only an assumption that I did not mess up somewhere. Perhaps the most important thing for @biljanaorescanin would be to confirm that she can reproduce my problem with the LDAS_GLOBAL/assim test case.

@gmao-rreichle
Copy link
Contributor

@bena-nasa, @mathomp4, here's an update. Both @biljanaorescanin and I ran more tests using Matt's LDAS_Nightly_Cron.csh script and noticed the following:

  1. Good news first: As @mathomp4 anticipated, GNU builds just fine when @biljanaorescanin updated GIT_G5FILE in LDAS_CUR.csh. When the tests do run (see below), the results are 0-diff.

  2. FWIW, I successfully ran the "conus" and "global" (Intel) tests for the develop branch, just to see if something was wrong with my environment or modules. All seems fine here.

  3. Now the bad news: @biljanaorescanin encountered my original "file exists" runtime error for the (GNU!) LDAS_GNUGLOBAL/assim case and also for the (Intel) LDAS_AGGCONUS/model case. But @biljanaorescanin was able to successfully run the (Intel) LDAS_GLOBAL/assim case, which had failed repeatedly for me. In my latest try, I encountered the "file exists" error in the (Intel) LDAS_CONUS/model case. This suggests that the bug is not specific to the assim (ie, ensemble/perturbations) configuration or the compiler. Rather, the "file exists" error appears to be intermittent, and -- I'm guessing -- may depend on the performance of the I/O system at runtime.

So something in the new infrastructure doesn't seem to want to work with GEOSldas. Could still be a GEOSldas scripting error, or perhaps a problem with how MAPL is used in GEOSldas. Or it's possibly a problem with MAPL HISTORY that is uncovered by GEOSldas but not the GCM. I can only hope it's not too hard to figure out.

@mathomp4
Copy link
Member Author

@gmao-rreichle I can trigger the same error in the GCM. I have a theory I need to discuss with @bena-nasa, but hopefully we can figure out a solution (might be simple, might be complex).

@bena-nasa
Copy link
Contributor

@biljanaorescanin can you try updating MAPL this this branch?
https://github.com/GEOS-ESM/MAPL/tree/hotfix/bmauer/fixes-ldas-pr-%23568
I think it is a trivial fix so if you can please give this hotfix a try and if it works we can patch MAPL.
Thanks

@mathomp4
Copy link
Member Author

I've pushed my components.yaml here to use the branch from @bena-nasa so @biljanaorescanin can easily test. 🤞🏼

@mathomp4
Copy link
Member Author

Why is the CI failing here? It should not be doing so.

@mathomp4
Copy link
Member Author

Okay. I think I figured out the reason the CI was failing. The hotfix branch was not up to date with MAPL main. I've merged main in and pushed.

I've also made a useless comment to re-trigger the CI

@biljanaorescanin
Copy link
Contributor

biljanaorescanin commented Jul 14, 2022

All except one assim test are now failing for the same error for both gnu and intel.

One that passes is for aggglobalcs assim case.

All model tests passed now as you can see below.

Error all failed tests have:
pe=00065 FAIL at line=03610 MAPL_HistoryGridComp.F90 <CURRENT.SMAP_L4_SM_gph.20170422_2230z.bin being created for History output already exists>

Screen Shot 2022-07-14 at 11 02 28 AM

@bena-nasa
Copy link
Contributor

@biljanaorescanin @gmao-rreichle this is baffling, well, I reverted all_pes to .true. and moved the file inquiring into the netcdf block of that if/else. Can you pull the hotfix branch and try again.

@biljanaorescanin
Copy link
Contributor

@bena-nasa @mathomp4 @gmao-rreichle
All tests passed! ( I've run my copy of systest so we know what we get for GNU as well)

@gmao-rreichle
Copy link
Contributor

Thanks, @biljanaorescanin. Before we can merge this, we need a new release of MAPL that includes @bena-nasa's hotfix, and once we have that, we'll need to run the tests one more time to be safe.

@mathomp4 mathomp4 changed the title Update to MAPL 2.23.0 Update to MAPL 2.23.1 Jul 15, 2022
@mathomp4
Copy link
Member Author

@biljanaorescanin @gmao-rreichle We have released MAPL 2.23.1 and I've updated this PR accordingly

@biljanaorescanin
Copy link
Contributor

I've retested from Matt's systests, all tests except GNU passed as we expected.
To pass GNU tests we need to point to the new baselibs - 7.5.0 ( this was tested from my sandbox -> all tests pass then)

@gmao-rreichle gmao-rreichle marked this pull request as ready for review July 16, 2022 10:58
@gmao-rreichle gmao-rreichle requested review from a team as code owners July 16, 2022 10:58
@gmao-rreichle gmao-rreichle changed the title Update to MAPL 2.23.1 Update to MAPL 2.23.1 and ESMA_env v4.2.0 Jul 16, 2022
@gmao-rreichle gmao-rreichle merged commit 9264e07 into develop Jul 16, 2022
@gmao-rreichle gmao-rreichle deleted the feature/mathomp4/update-ldas-mapl-2230 branch July 16, 2022 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants