Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove multiply-defined environment variables #52

Merged
merged 6 commits into from
Jul 18, 2024
Merged

Conversation

weaverba137
Copy link
Member

This PR closes #49.

Important question for @sbailey, @marcelo-alvarez, @akremin, @tskisner: we've decided to consolidate the definition of various NERSC-performance environment variables, e.g. KMP_INIT_AT_FORK. Was there a specific reason that the environment variables were defined in several different places in the module file, i.e., the definitions have to be made before at least some DESI modules are loaded?

@marcelo-alvarez
Copy link
Contributor

This PR closes #49.

Important question for @sbailey, @marcelo-alvarez, @akremin, @tskisner: we've decided to consolidate the definition of various NERSC-performance environment variables, e.g. KMP_INIT_AT_FORK. Was there a specific reason that the environment variables were defined in several different places in the module file, i.e., the definitions have to be made before at least some DESI modules are loaded?

I can't remember if there was a specific reason to define the variables in several different places. However, I don't think the definitions have to be made before any DESI module is loaded, and don't anticipate any problems in the environment caused by defining them in the same place as you've done in this branch.

As far as I can tell, the comment

# work around
# https://docs.nersc.gov/development/languages/python/using-python-perlmutter/#mpi-issues

applies to the following lines

setenv CXI_FORK_SAFE 1
setenv CXI_FORK_SAFE_HP 1
module load evp-patch

so you probably want to group those together just below the comment.

@weaverba137
Copy link
Member Author

Thank you @marcelo-alvarez, I've actually subdivided those two by the subsection of the #mpi-issues link.

I also notice that we do

module unload cudatoolkit
module load cudatoolkit/12.2

But 12.2 is the default, so why specifically is this done?

@weaverba137
Copy link
Member Author

PS, it would also be nice to have specific documentation for KMP_INIT_AT_FORK (does not appear at all in docs.nersc.gov) and MPI4PY_RC_RECV_MPROBE, which supposedly is described at https://docs.nersc.gov/current/#ongoing-issues, but in fact does not appear at all.

@marcelo-alvarez
Copy link
Contributor

I agree it doesn't make sense as is. The module unload cudatoolkit is a remnant of when desiconda had its own cudatoolkit installed as a dependency, in order to avoid conflicts between that version and the NERSC system-wide version in the module.

@sbailey would know for sure, but it looks like desihub/desiconda#68 removed the desiconda cudatoolkit, and so all that needs to be done in desimodules is to load the NERSC system-wide version via module load cudatoolkit (with no unloading), with the caveat that a future change to the default cudatoolkit module may break the environment for unforeseen reasons. Such a caveat applies for other default modules as well, so that would argue in favor of just using the default. I think either the module load cudatoolkit or module load cudatoolkit/12.2 would be fine.

As for KMP_INIT_AT_FORK things might break if you remove it, see e.g. desihub/nightwatch#335 (comment). It would be great if you could get to the bottom of what cases this is actually needed for and remove it if there are better solutions, but otherwise you might just want to leave it in, as inelegant as it is.

With regard to MPI4PY_RC_RECV_MPROBE='False' there was an ongoing issue at https://docs.nersc.gov/current/ circa August 2022 which was described there as follows:

Our slingshot 11 libfabric (Perlmutter CPU nodes) is currently missing the functions MPI_Mprobe()/MPI_Improbe() and MPI_Mrecv()/MPI_Imrecv(). This may especially impact mpi4py users attempting to send/receive pickled objects. One workaround may be to set export MPI4PY_RC_RECV_MPROBE='False'. The current estimated timeframe for a fix is January 2023. If these missing functions are impacting your workload, please open a ticket to let us know.

Given it is now July 2024, and there is no mention of this at the same page, you might assume it's fixed and remove it, though you may want to test it with a test-prod just to make sure. @sbailey should probably make this call.

@sbailey
Copy link
Contributor

sbailey commented Jul 17, 2024

Thanks for digging up the history, @marcelo-alvarez . For the purposes of this cleanup update, let's aim for achieving the exact same environment as we currently have, since we know that it works with the current pipeline. i.e. this is about making the modules more maintainable in the future, not about changing the environment now (don't fix what isn't broken).

We probably don't need MPI4PY_RC_RECV_MPROBE='False' anymore, but it works as-is, it was put in place to solve an intermittent instability, and it would be difficult to prove that removing it doesn't re-introduce a rare crash. Let's not open that can of worms now.

@weaverba137
Copy link
Member Author

I'm OK with leaving the environment variables in place, but I'm still not sure about:

module unload cudatoolkit
module load cudatoolkit/12.2

I verified that cudatoolkit/12.2 is loaded by default when logging in to NERSC, without invoking any DESI environment.

@weaverba137
Copy link
Member Author

Another question:

This version is set as the "default" desiconda version before being overridden on perlmutter. However, this version does not exist at all. Should the default be a version that actually exists?

set desiconda_version 20211217-2.0.0

@weaverba137 weaverba137 marked this pull request as ready for review July 17, 2024 21:26
@weaverba137
Copy link
Member Author

Marking this as ready for review. I've added a few new questions above already.

@weaverba137 weaverba137 changed the title [WIP] Remove multiply-defined environment variables Remove multiply-defined environment variables Jul 17, 2024
Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cleanup. I verified that at NERSC it produces the same environment as current main, except for

  • adding $DESI_TILES (expected)
  • PATH has different order due to module unload/load cudatoolkit at the end instead of the beginning (I don't think this matters)
  • Miscellaneous module bookkeeping vars like _ModuleTable011_ and _LMFILES_ (looks ok)

I put several comments inline requesting changes to continue supporting non-NERSC locations like KPNO. I'd also like to confirm with @schlafly and/or @araichoor about how the equivalent of $DESI_TILES is tracked in surveyops / afternoonplanning to know where the fiberassign/tiles/trunk checkout is to make sure we don't re-invent an arbitrarily different wheel.

main Outdated Show resolved Hide resolved
main Outdated Show resolved Hide resolved
main Show resolved Hide resolved
main Show resolved Hide resolved
@weaverba137
Copy link
Member Author

@sbailey, there's still the question about the default value of desiconda_version, but otherwise, I've applied all requested changes.

@sbailey
Copy link
Contributor

sbailey commented Jul 18, 2024

From Slack #survey-ops discussions, online surveyops has been using $FIBER_ASSIGN_DIR for what we called $DESI_TILES here. I suggest that we switch to using $FIBER_ASSIGN_DIR here (and desihub/desispec#2294) to match that prior usage. @weaverba137 if you feel strongly that we should stick with $DESI_TILES, please document your thoughts here. I could be ok with that, but that doesn't necessarily imply that we would update surveyops to switch to $DESI_TILES instead. i.e. if we go with $DESI_TILES here, it might be better to live with the duplicate names rather than risk breaking operations even briefly for a cosmetic change.

@weaverba137
Copy link
Member Author

I don't have any issue with using FIBER_ASSIGN_DIR. I've already committed the change on this PR.

And oops, I forgot to push my previous commit!

@weaverba137
Copy link
Member Author

I've also changed to FIBER_ASSIGN_DIR in desihub/desispec#2294.

@sbailey
Copy link
Contributor

sbailey commented Jul 18, 2024

I did some digging on desiconda_version: the current structure is mostly a leftover from supporting both cori and perlmutter simultaneously, i.e. doesn't apply now but could apply again in the future. At KPNO, we set DESI_PRODUCT_ROOT and "module load desiconda/some-version" ahead of time, so desiconda_version is never used. i.e. there is opportunity for conceptual cleanup here as well, but for now I've made the minimal change of making the default a version that exists (even if that default is never used), and adding some comments.

I re-verified that the environment is the same as current main plus $FIBER_ASSIGN_DIR, i.e. looks good. I will merge this now but will do separate testing of this plus desihub/desispec#2294 (and similar other repos) before merging those.

@sbailey sbailey merged commit e58ff93 into main Jul 18, 2024
@sbailey sbailey deleted the multiple-env-var branch July 18, 2024 23:37
@weaverba137
Copy link
Member Author

@sbailey, thank you, I agree with your documentation on desiconda_version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Environment variables are multiply-defined in several packages including desimodules itself
3 participants